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Abstract 

In this paper, we analyze a collaborative filter that answers the simple question: What is popular amongst your 
"friends"? While this basic principle seems to be prevalent in many practical implementations, there does not appear 
to be much theoretical analysis of its performance. In this paper, we partly fill this gap. While recent works on this 
topic, such as the low-rank matrix completion literature, consider the probability of error in recovering the entire rating 
matrix, we consider probability of an error in an individual recommendation (bit error rate (BER)). For a mathematical 
model introduced in (T], (2), we identify three regimes of operation for our algorithm (named Popularity Amongst 
Friends (PAF)) in the limit as the matrix size grows to infinity. In a regime characterized by large number of samples 
and small degrees of freedom (defined precisely for the model in the paper), the asymptotic BER is zero; in a regime 
characterized by large number of samples and large degrees of freedom, the asymptotic BER is bounded away from 
and 1/2 (and is identified exactly except for a special case); and in a regime characterized by a small number of 
samples, the algorithm fails. We then compare these results with the performance of the optimal recommender. We 
also present numerical results for the MovieLens and Netflix datasets. We discuss the empirical performance in light 
of our theoretical results and compare with an approach (|3|) based on low-rank matrix completion. 

I. Introduction 

Recommendation systems suggest relevant content to users based on their previous choices. For example, it is 
common to predict user-item ratings based on available ratings and recommend items based on the predicted values 
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(see iU). In the collaborative filtering (CF) approach to recommender systems {5], information about a group of 
users is used to make recommendations to an individual user There are two popular classes of CF techniques: a) 
neighborhood based methods (|l6l, Q, lH), Q), and b) latent factor models ifTOll . Neighborhood based methods 
compute similarities amongst the users (and/or amongst the items), and use information about a set of "similar" 
users (and/or "similar" items) to make recommendations. On the other hand, the latent factor models assume that 
the entire user-item rating matrix is described by a small number of parameters, which are then estimated from 
available data. For example, the low -rank matrix model in [31, ifTTl is an example of this class. In the remainder 
of this section, we outline our goals in the context of existing works, and briefly describe the nature of our results. 

A. Prior Work and Our Goals 

Recently there has been a lot of interest in obtaining fundamental limits on the number of samples needed to 
recover a low-rank matrix with high probability ( ifl^ . Il3l , lfT3]| . ifTTl ). Most of these methods try to find a matrix 
with lowest possible rank that agree with the observed samples. This is reminiscent of compressed sensing, where 
one tries to find the sparsest vector that satisfies certain affine constraints lfT4l . ifTSl . In another model (lUl, 
the rating matrix is assumed to be obtained from a block constant matrix by applying unknown row and column 
permutations, a noisy discrete memoryless channel representing noisy user behavior, and an erasure channel denoting 
missing entries. Instead of matrix completion, the goal for such a model is to estimate the underlying "noiseless" 
matrix and the performance is dictated by the cluster size (the size of the block of constancy). For their respective 
models, the above listed works derive a threshold result: If the number of degrees of freedom (defined appropriately 
for the model) is larger than a threshold, then error free recovery is not possible, but otherwise, there is a polynomial 
time algorithm that recovers the entire matrix with high probabihty. Since empirical results ( llT6ll . Q, 113) suggest 
that perfect recovery of the entire matrix might not be possible in practice, it is natural to seek a finer analysis 
in the regime where perfect recovery is not possible. In practice, we need not predict all the missing ratings - it 
suffices to recommend a few items with high ratings. With this in mind, in this paper we recommend one item 
to each user, and consider the probability that a given recommendation is incorrect as the performance metric. 
Using this metric, we seek to develop a theoretical understanding of a basic principle that is prevalent in practical 
systems. This basic principle recommends items to an individual based on their popularity amongst similar users 
and is the main motivation for neighborhood based methods (111, Q, iS)). This principle is also similar to the 
/c-nearest neighbors (KNN) algorithms for classification ifTSl . In this paper, we analyze a collaborative filter based 
on this principle for the data model proposed in |[T], ||2l. Further, we also evaluate and discuss performance on the 
MovieLens and Netflix datasets in light of our theoretical results and earlier works inspired by low-rank matrix 
completion/approximation. Below, we summarize our main results. 



B. Organization and Summary of Results 

Typical rating data belongs to a finite alphabet. In this paper, we consider a binary alphabet ('like' or 'dislike'). 



which is of special interest (see Section III-A for a discussion of this point). In Section II-A we describe our 
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Fig. 1. A schematic view of the main results. The three shaded regions correspond to the three different parts of the theorem. Only the 
asymptotic behaviour is presented in the figure. 



algorithm - named Popularity Amongst Friends (PAF) - for a binary rating matrix. In Section 11-B we show some 
experimental results on the MovieLens and Netflix datasets. We compare with OptSpace 13], which is motivated by 
the low-rank completion problem and is a representative of this class of works. The empirical results reveal that the 
PAF algorithm has similar BER compared to OptSpace. We also present results for different values of the algorithm 
parameter (size of list of friends). Having demonstrated the algorithm performance on real data, in Section |lllj we 
turn to its theoretical analysis. We consider the data model proposed in IT], 

Summary of the data model: To motivate this model, consider an ideal situation where users and items are 
clustered, and users within a cluster rate items within a cluster by the same value. The rating matrix in this ideal 
situation (denoted by X) is then a block constant matrix. The observations are obtained from X by passing its 



entries through a binary symmetric channel (BSC) with parameter p (defined in Section III-Ai, and an erasure 



channel with erasure probability e (defined in Section III-Al. Moreover, the row and column clusters are unknown. 
The block constant model captures the fact that similar users rate similar items similarly, and the unknown row 
(column) clusters represent the fact that the sets of similar users (items) are not known. The erasures represent 
missing data, while the BSC represents the noisy behavior of the users. This model is described in detail in Section 



IIl-A In this paper, we present a detailed analysis of PAF for this model for the underlying data. 



To give an outline of our results, suppose that the rating matrix is of size n x n and the erasure probability 
e = 1 — c/n" for c > 0, a G [0, 1] and the BSC error probability p. We note that a controls the rate at which the 
erasure probability approaches 1. This rate plays a crucial role in determining the performance of PAF. Suppose the 
rows, as well as the columns are clustered with each cluster having size k. We identify three different performance 
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regimes, which are illustrated in Fig. [T] in the limit as n ^ oo. 

• When a G [0, 1/2), if the cluster size (k) is greater than n""'''" where 7„ — > 0, then the BER approaches 
(Phase I of Fig. [l]). This result in stated in Theorem [T] of Section III-B 



When a e [0, 1/2), if the cluster size (fc) is less than n"^'^, 7 > 0, the BER is bounded away from zero and a 
lower bound is obtained in terms of the BSC error probability and 7 (Phase II of Fig. [T]). This result is stated 
in Theorem |2] of Section III-B Further, in Theorem |2] we also identify the exact limiting BER (except for 
some special cases of 7) and also the optimal parameter for PAF. 

For a > 1/2, the BER always approaches 1/2 (Phase III of Fig. [T]l. This result is stated in Theorem [3] of 
Section HITB] 

We then study a lower bound on the performance of such a recommender, and compare this with the 
performance of PAF. We state this result in Theorem [4] of Section III-B 



The main results are proven in Section IV Section [V] and Section VI followed by a conclusion in Section VIII 
We present the proofs of several related lemmas in the Appendix. 



II. The Algorithm and its Performance on Real Data 



In Section II-A we describe the PAF algorithm, and in Section II-B we evaluate its performance on some real 
datasets. 



A. The PAF algorithm 

Suppose Y is an m x n user-item matrix with entries in {0, 1, *}. The rows represent the users and the columns 
represents the items. If the (i, j)th entry Y(i, j) is 1 (or 0), then we interpret it as "user i likes (or does not like) 
the item j". A indicates an unobserved rating. Upon observing Y, we want to recommend an item (a column) 
to user 1. For rows i and j, consider the number of entries that they agree on: 

n 

X! l{Y(i,fe)5^*} • l{YO-,fc)#*} • l{Y(j,fc)=YO-,fe)}, (1) 

fc=l 

where 1{ j denotes the indicator function. We use the following PAF algorithm to recommend an item jo to user 1. 
PAF(T) : 

1) (Select the top T nearest rows) Compute su, for i = 1, 2, 3, ... , m. Select the top T rows with 
the highest values of similarity, where T is a parameter whose choice is discussed later 

2) (Pick the most popular column) Amongst the columns j such that Y(l,j/) = *, select the column 
having maximum number of I's amongst the top T neighbors. Break ties randomly. 

Suppose we represent each row by a vertex in a graph with an edge between vertex i and j iff > 0. Then to 
recommend an item to user 1, the above algorithm depends only on the rows neighboring to user 1, and chooses 
the most popular item amongst the top few neighbors. Let d denote the average degree of a vertex in this graph. 
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Fig. 2. Performance comparison of OptSpace witli PAF for the MovieLens dataset (1,000,209 ratings), as the threshold used to quantize the 
estimated values of OptSpace changes. 



Then the complexity of Step 1 is 0(dm), and since d is usually much smaller than m, the overall complexity of 
Step 1 is low. 

We note that several variants of the similarity metric are feasible, but as we show below, the PAF algorithm 
described above has competitive performance on real datasets, and is also amenable to analysis. 

B. Experimental results and discussion 

We consider the MovieLens data lfT9l (consisting of 1,000,209 ratings for 3952 movies made by 6040 users) 
as well as a snapshot of the Netflix data ID (consisting of 818,229 ratings for 4289 movies made by 7457 users, 
obtained in year 2000). For both MovieLens and Netflix, the ratings are integers between 1 and 5. To apply the 
PAF algorithm, we quantize the ratings: 4 and 5 are mapped to 1 ("recommended" movies), while 1, 2 and 3 are 
mapped to ("not recommended" movies). We split the ratings as train and test data as follows. For each user, we 
randomly hide 30% of the ratings, and use these as the test data. We train our algorithms on the remaining data. 
We can check correctness of a recommendation only if the rating of the recommended movie is hidden. 

We compare the performance of the PAF algorithm with OptSpace (the algorithm proposed in |3|). OptSpace 
uses ratings on the scale 1-5 as input and outputs real valued rating estimates. Since OptSpace outputs real values, 
in order to compute the BER, we map the predicted ratings below 3.5 to 0, and the predicted ratings above 3.5 to 
L The BER is computed over the same set as for the PAF algorithm. Using a threshold of 3.5 is not necessarily 
optimal. In Fig. [2] we see how the performance of OptSpace vary for the MovieLens dataset as we change the 
threshold from 1 to 5. When the threshold is 0, OptSpace estimate all the entries as I's, and it's performance 
exactly matches with PAF. At the other extreme, when the threshold is 5, OptSpace estimates everything as O's, 
and it's performance degrades. Because of the rating quantization scheme that we use (mapping {1, 2, 3} to 0, and 
{4, 5} to 1), only a threshold between 3 and 4 makes sense. Since we do not see any significant improvement of 
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performance by optimizing over this threshold, we continue to use 3.5 as the threshold. Similar behavior is also 
observed for the Netflix dataset. For both PAF and OptSpace, we have chosen the parameters that yield the best 
performance on the test data. 



TABLE I 

Comparison of BER and RMSE of paf with OptSpace 



(a) Original MovieLens 
(1,000,209 ratings) 



data 



(b) A snapshot of Netflix data 
(818,229 ratings) 



(c) MovieLens data, after removing 
the popular movies (1,000,209 rat- 
ings) 





PAF(IOO) 


OptSpace 


BER 


0.103 


0.108 


RMSE 


0.748 


0.733 





PAF(80) 


OptSpace 


BER 


0.116 


0.127 


RMSE 


0.942 


0.742 





PAF(55) 


OptSpace 


BER 


0.321 


0.327 


RMSE 


1.010 


0.901 



Table 1(a) and 1(b) show that in terms of BER, the PAF algorithm and OptSpace are close for both the MovieLens 
as well as the Netflix data. We see that PAF is comparable to OptSpace. We also compare both these methods in 
terms of their root mean square error (RMSE). To compute the RMSE for PAF we map the binary estimates to 
a scale a 1-5 as the following. A is mapped to 2 (average of {1,2,3}), and a 1 is mapped to 4.5 (average of 
{4, 5}). (Although this mapping is not necessarily optimal, we do not try to optimize it.) From the RMSE values 
in Table |I(a)| and Table |I(b)| we see that for the MovieLens dataset both the algorithms are comparable and for 
the snapshot of Netflix dataset, OptSpace performs better than PAF in terms of RMSE. A comparison of this with 
the BER comparison tells us that improvement in RMSE has little impact on BER, which is a reflection of the 
poor confidence interval in the estimate. For this reason, we believe binary alphabet and the BER metric are more 



relevant for these datasets. This point is discussed further in Section III-A in the paragraph Why binary 



Fig. 3(a) shows how the PAF(r) performs for different values of T for the MovieLens data. We see that the 
BER is minimized around T — 100. We also note that for the snapshot of Netflix data we consider, the BER is 
minimized at around T = 80. In Theorem |2] of Section [Tll-B| we show that the minimum BER is achieved atT = k 



(the "true" cluster size), and hence the minimum in Fig. 3(a) is related to the degrees of freedom in the data 



If we use T = m, then we get the global popularity algorithm, and it has a BER of about 0.16 for the MovieLens 
dataset. This indicates that the dataset has several movies, which are popular amongst most users, and hence their 
ratings are easy to predict. The true test of a collaborative filter is on datasets where a single row or column does 
not reveal too much information about its missing entries. Since PAF algorithm is biased towards globally popular 
movies, to test its performance further, for the MovieLens dataset we remove all movies with more than 60% ratings 
as 1 . Even for this "filtered" dataset, we see from Table |I(c)| that the PAF algorithm and OptSpace are comparable. 



Fig. 3(b) shows that the minimum BER is achieved when T is around 55. 

Remark 1: If we look at PAF, we see that most of its computational time is spent in finding the row correlations. 
As the data evolves with time, in the sense that new user/movie enters in the data or users rate more existing 
movies, then the row correlations can be updated efficiently since usually only a few of the row correlations are 
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Fig. 3. Bit error rate of PAF for different values of T, for the MovieLens data with 1,000,209 ratings. While (a) compares the BER for the 
original MovieLens data, (b) compares the BER for the MovieLens data after filtering out the popular movies with more than 60 % of their 
ratings as I's. 



affected at a time. 

In summary, the PAF algorithm yields competitive performance on real data, even though it used only quantized 
ratings (as against to 1-5 for OptSpace). To explain the competitive performance of the PAF algorithm, in the 
following section, we analyze its performance for a binary matrix model introduced in 

III. Analysis of the paf Algorithm 

In Section |III-A| we describe our mathematical model (first introduced in IT], 13) and in Section [III-B| we state 
and discuss our main results. But before we begin with analyzing PAF, we set up some notation. 
Notation: By X ^ B{n,p) we mean that a random variable X is binomially distributed with parameters n and p. 
For two real valued functions f{n) and g{n), if there exist strictly positive M and uq such that |/(n)| < M\g{n)\ 
for all n > uq, then we denote f{n) — 0{g{n)) and g{n) — n{f{n)). If f{n) = 0{g{n)) and f{n) ~ Vl{g{n)) then 
we say J{n) = Q{g{n)). We say f{n) = o{g{n)) if lim„^oo ^ = 0, and f{n) = g{n) if lim„^oo ^ = 1- For 
a sequence of real valued functions and g{n), if there exist strictly positive M and (both independent 

of i) such that for i £ I and for n > no we have < M\g{n)\, then we denote {fi{n)^i^i = 0{g{n)). Other 

order notations for sequence of functions are defined in a similar manner For a matrix X, X(:, j) denotes the jth 
column of X. For a vector y e {0, 1, where * denotes an erasure, |y|o, |?;|i and \y\ represent number of O's, 
number of I's and the total number of O's and I's respectively. For a sequence of events if /'[A„] — > 1 with 

n, then we say that An occurs w.h.p. . For parameters that depend on the data size n (e.g., e, fc, etc.), we do not 
show this dependence explicitly unless it is not clear from the context. 
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Matrix with unknown 
row and column clusters 



BSCip) 



Matrix with eiTors 



E'rasure{e 



The observed matrix with 
eiTors and erasures 



Fig. 4. Summary of the data model. 



A. The Data Model 

We consider an nxn matrix X whose entries are binary. The rows of the matrix represent users and the columns 
represent items. Suppose A = {^i}r=i and B = {Bi}^^^ are two partitions of [1 : n], representing sets of similar 
users and items. We call the sets A, x Bj clusters, and call A/s (Bj's) the row (column) clusters. We assume that 
for all i ^ 1,2, r, we have \Ai\ = \Bi\ = k. The matrix X is constant over the cluster Ai x Bj and the entries are 
1.1. d. BernoulU (1/2) [] across the clusters. Formally, if {p,q) ^ Ai x Bj, then 'X{p,q) — Xij where {xijYi j=i 
i.i.d. Bernoulli(l/2). The observed matrix Y is obtained by passing the entries of X independently through binary 
symmetric channel (BSC) (defined below) with parameter p, and then through a binary erasure channel (defined 
below) with erasure probability e. The entries of the observed matrix Y are from {0, 1,*}, where * denotes an 
erased entry. Fig. [4] Summarizes our data model. 

The BSC is a binary input, binary output channel that makes an error with probability p (E3)- In our case, 
it models noisy behavior of users. In the binary erasure channel, every bit is erased with probability e, and the 
receiver knows which bits have been erased ( |[20l ). The erasure channel models the missing entries in the rating 
matrix. 

Why binary?: We consider the case of binary entries for simplicity, and like in fT\, this can be relaxed to allow 
any finite alphabet. The choice of the binary alphabet not only leads to a simpler description of the main ideas, but 
as explained below, it is also a case of practical interest. 

• For datasets such as Netflix, even the best known methods have a root mean square error (RMSE) of 0.8567 
llT6l . which on a scale of 1-5 elicits poor confidence in the estimate. This is because, even in the absence 
of variance (i.e., when all the contribution to RMSE comes from the bias), the confidence interval for such 
an estimate is ±0.8567, which shows poor confidence on a scale of 1-5. However, the task of determining 
whether a movie is liked (say rating > 4) or not can be done with more reliability, suggesting the importance 



of the binary alphabet in what appears to be very noisy data. (In fact, in Section II-B we saw that the PAF 
algorithm uses quantized inputs on the binary scale (instead of 1-5) but still yields competitive performance 
compared to OptSpace, which uses the unquantized inputs.) 
• In many datasets, users tend to rate items either very high or very low. For example, this was observed in a 
recent study by Youtube ETl . Il22l . which prompted the switch to a binary rating scale instead of 1-5. 

'A random variable X is called Bernoulli(p), if Pr[X = 1] = p, and Pr[X = 0] = 1 — p. 



July 18, 2011 



DRAFT 



9 



We also note that all our results can be extended to the case when X is 7n x n and the clusters are nonuniform, 
provided m = Q{n) and all the cluster sizes are of same order Since the non-uniform case does not offer any 
additional new insights, in this paper we have chosen to use the uniform case, which leads substantially simpler 
notation. 



B. Main Results and Discussion 

Upon observing Y, suppose PAF(r) recommends a column jmax- The probability of error for this recommen- 
dation is 

Pe[PAF(T)] = Pr[X{l,Jma.) = 0]. 

Here we study how the PAF algorithm performs for the matrix model discussed above, and identify three different 
performance regimes based on the erasure rate and the cluster size. In the following, we assume that the erasure 
probability e = 1 — ^ for some c > and a > 0, and assume that the true cluster size k is known. The value of a 
determines the rate at which the erasure probability approaches unity as n grows. We have the following theorems. 

1 ) Low Erasure Rate, Large Cluster Size: This regime is illustrated by the Phase I of Fig. [T] and the main result 
is as follows. Recall that without loss of generality, we recommend an item to user 1. 

Theorem 1 (a < 1/2, large cluster size): Assume that a e (0, 1/2), and the BSC error probability p € [0, 1/2). 
Suppose there exists a sequence 7„ > such that 7„ — > and k > n""'*'". Then the following are true. 

a) If fc = 0(71), then Pe[PAF(fc)] 0. 

b) If fc = 6(n) , then Pg [PAF(A:)| not all entries of the 1st row of X are O's] 0. 
For a ~ 0, the error probability goes to zero as long as k increases to infinity with n. 



This result is proved in Section IV but next we describe the main intuition behind the result. When a < 1/2, 
there are enough samples to distinguish the neighbors from Ai ("good" neighbors) from the neighbors outside Ai 
("bad" neighbors). In fact, all the top k neighbors selected by the PAF algorithm are good with high probability. 
Moreover, when 7„ — > 0, we show that the most popular column has overwhelming number of I's compared to O's. 
We then show that this cannot happen unless the true rating of the most popular column is 1 with high probability 
(w.h.p.). 

Remark 2: When r is bounded (i.e., k — 0(n)), we need the assumption that not all entries in the 1st row of 
X are O's, because there is a nonzero probability that all entries of the 1st row of X are O's. In this case we will 
always make a wrong recommendation. 

Remark 3: It is also of interest to know the rate at which the error probability goes to zero. The convergence 
rate crucially depends on 7„ in a non-trivial manner and we are unable to find a clean bound. However, for 7„ = 
we can find a bound on the error probability, and we have Pe[PAF(fc)] — O ^l/c^'°^ for some Ci > 1. We also 
note that this bound is not tight in general. 



2) Low Erasure Rate, Small Cluster Size: From the empirical results in Section II-B we see that < BER < 1/2. 
If we assume that our asymptotic model is apphcable to the data size considered, then the regime of Theorem [l] 
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does not seem to capture this. Theorem |2] stated below identifies a regime where the asymptotic BER of the PAF 
algorithm is bounded away from both and 1/2. (Phase II of Fig. [T] illustrates this regime.) 

Theorem 2 (a < 1/2, small cluster size): Assume that a £ (0, 1/2), and the BSC error probabihty p e [0, 1/2). 
Suppose there is a constant 7 e (0,a] and = o(l) such that the cluster size k — 7i"^'>'+S". Then the limit 
lim„^co ^'e[PAF(fc)] exists, and we have the following. 

• If 1/7 is not an integer, then 

[ij 

lim Pe[PAF(fc)] ^ " 



pUJ + (1 -p)l-,\ 

• If 1/7 is an integer, then 

i i-i 
< lim Pe[PAF(fc)] < 

P'i+[l—p)i "-'■oo pj -^(^l^pji 

Moreover T = fc is optimal, in the sense, that VT, 

lim Pe[PAF(A;)] < lim inf Pe[PAF(r)]. 

n— J-oo 71— >-oo 

We prove this theorem in Section |V] but below we provide some intuition. 

As in Theorem [T| when a < 1/2, for T = k most neighbors picked are good with high probability. However, 
since 7 > 0, the number of I's for the most popular movie is concentrated on [I/7J when I/7 is not an integer 
(and is concentrated on {I/7 — 1, 1/7} when I/7 is an integer), which is finite. Thus, even though the algorithm 
picks the good neighbors, it fails to average out the noise in the ratings completely, leading to a BER bounded 
away from 0. 

Furthermore, Theorem |2] states that in the limit as rt — > 00, T = fc is optimal. This is expected since for T < k 
we do not use the full set of good neighbors, and for T > k, we pick bad neighbors. As T approaches n, the PAF 
algorithm approaches the global popularity algorithm, and for our mathematical model, its BER is 1/2. We note 
that for the MovieLens dataset with popular movies removed. Fig. [3] suggests an optimal value of T = 55, which 
is a reflection of the user cluster size. 

3) High Erasure Rate: The above two theorems discuss the case when a < 1/2. In this case, w.h.p. the PAF 
algorithm can filter out the bad neighbors. But when a > 1/2, there are few samples to distinguish the good 
neighbors from the bad ones. In fact, amongst the top T neighbors, only a vanishingly small fraction are good 
neighbors. This forces the BER to approach 1/2, and is stated in Theorem [3] below, which is proved in Section VI 

Theorem 3 (a > 1/2): Assume that a > 1/2, the BSC error probability p — Q, and k — o{n). Then VT, 

Pe[PAF(r)] ^ 1/2. 

In the regime of Theorem |3] the errors occur mainly due to the fact that the PAF algorithm cannot identify the 
good neighbors. Some side information about the similarity amongst users (for example information about social 
connections, locations, etc.) would help the algorithm to find the good neighbors. In Fig. [T] Phase III represents 
this high erasure rate regime. 
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Remark 4: In the regime of Theorem [3] i.e., for a > 1/2, we need that r — > oo to prove that BER goes to 1/2. 
If r stays bounded, then we beheve that the BER would be bounded away from 1/2 and 0. But we are unable to 
prove this yet. 

A numerical example: Given the above three theorems describing the asymptotic performance of PAF under 
various regimes, it is of interest to understand if such asymptotics are valid for finite data size. To answer this, we 
simulate datasets using the our data model with n ~ 1000, fc — 10, p = 0.2, and with varying a. Fig. |5] shows that 
even for this small dataset, the asymptotic theory matches well with the simulation for a < 1/3 and a > 1/2. Since 
k = v}!'^, a < 1/3 represents the regime of Theorem [l] Similarly, a > 1/2 represents the regime of Theorem [s] 
In the regime of Theorem |2] (i.e., for 1/3 < a < 1/2), there is a gap between the asymptote and simulation, and 
we need to consider larger dataset to reduce this gap. 

4) Suboptimality of PAF: Having seen the performance of PAF in the above theorems, from a mathematical 
perspective it is natural to ask if PAF is optimal for the above data model. Let Pe{n) denotes the error probability 
of a given recommender, parametrized by the matrix size n. 

Theorem 4: Suppose the BSC error probability p e [0, 1/2). 

• Converse: If < n^^^^^" for 7 e [0,min(Q;, 1)] and gn — o(l), then for any recommender 

[ij 

lim inf Pe(n) > -pn ttt- 

pLtJ + (1 -p)LtJ 

• Achievability: Assume a e (0, 1/2), and suppose that k — o(n). If there exists 7„ = o(l) such that /c^ > 
^a-7n^ then there exists an algorithm (described in the proof) s.t., 

Pe{n) 0. 

Moreover, if k^ — n°'~'^+9n fQj- g (0,min(a, 1)] and g„ = o(l), then 

I -I r--ii 

^ < lim Pe{n) < ^ 
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Remark 5: We note that the lower and upper bound in the final expression of Theorem |4] are identical, unless 
1/7 is an integer. 

The lower bound in the converse is obtained by using an oracle, which tells us the true clusters (A and B), and then 
using techniques similar to ones used in proving Theorem [5] The achievabihty proof uses that for a < 1/2 and 
r > ci log n, w.h.p. we can cluster the matrix correctly. Then the result for fc^ > follows from arguments 

similar to those used in proving Theorem [T| and the result for fc^ = n" is obtained by using arguments 

similar to those used in proving Theorem |2] A more detailed proof is presented in Section |VII| 

Comparing Theorem[T]and Theorem|2]with Theorem|4] we see that PAF is suboptimal. But PAF is computationally 
faster than the algorithm that achieves the bounds in Theorem |4] (described in the proof), since it does not require 
to do any explicit clustering of the rows and the columns. This is one of the main reasons why we consider PAF in 



this paper (instead of the clustering based algorithm in Q or in the proof of Theorem]?]). In Section II-B we have 
already seen the competitive performance of PAF on real world datasets, which makes PAF even more appealing. 
In the following, we present the proofs of these four theorems. 

IV. Proof of Theorem[T] 

The PAF algorithm has two steps. First we find the neighbors, and then we recommend using the popularity 
amongst the neighbors. We analyze the errors in these steps separately. 

A. Analysis of Step 1 of the Algorithm 

We show that for a < 1/2, w.h.p. the top k rows are all from the cluster of user 1, namely Ai |^ First we obtain 
the following two lemmas that will help us in proving this. Recall that p denotes the error probability of the BSC. 
Lemma 1 (Overlap with rows within cluster): For any 5 E (0, 1), we have w.h.p. for all i in Ai, 

Su>n'-"'iil-pf+p'){l-S). 

Proof: We see that sn ~ B{n, 1 — e) and for i e Ai\{l}, su ^ B{n, (1 — e)^((l — p)^ In other words, 

for i e Ai\{l}, is a Binomial random variable with E[sii] = n(l — e)^((l — p)^ = cn^^^"((l — p)^ 
and sii is a Binomial random variable with E[sii] = n{l — e) = cn^^". The lemma is now a direct consequence 
of the Chernoff bound ||231 Theorem 1.1], together with the union bound. ■ 
Lemma 2 (Overlap with rows outside cluster): There is a constant ci G (0, 1), such that for any 5 G (0, 1), we 
have w.h.p. for all i outside Ai, 

su < ni-2"[(l +p2 _ ^^(1 _ 2p)2](l + S). 

Proof: The proof is given in Appendix ]A] ■ 

^If for a row cluster Ai, X^i^ (X restricted to the rows of Ai) is identical to X^j^, then for all practical purpose we can include all the 
rows of Ai in Ai itself. Throughout this proof, we assume that the rows from all the clusters identical to Ai have already been included in 
Ai. Thus for i ^ Ai, the ith row and the 1st row differ at least at one column cluster. 
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Since p < 1/2, the lower bound of Lemma [T| is greater than the upper bound of Lemma |2] for a sufficiently small 
value of S. Thus, w.h.p. we have mini^Ai su > maxj^Ai sij, i.e., all the top k rows chosen by PAF(fc) are from 
Ai. In other words, if £'i_„ denotes the event that there is an error in Step 1 of the algorithm, i.e., PAF(fc) chooses 
some rows from outside Ai, then 

Pr[Ei,n] -> 0, as n oo. (2) 
Remark 6: The above two lemmas are valid for both case a) with k = o{n) and case b) with k = Q{n). 

B. Analysis of Step 2 of the Algorithm 

Suppose a £ (0, 1/2). First we condition on the event that Step 1 does not make an error (i.e., the event -Ef „). 
Let S denote the set of column indices such that X(l,j) = 1 and Y(l, j) = *, and suppose and denote the 
sub-matrices of X and Y respectively, consisting of the top k neighbors. Also let jmax denote the most popular 
column chosen by PAF(A;), i.e., jmax ■— argmax^g^ |Yfc(:, The statistics of the columns in S are independent 
of the event i?i^„. Thus, conditioned on i?i „, for j £ S, we have |Yfc(:,j)|i ^ B{k, (1 — e)(l — p)). Define 
/iy := _E[|Yj.(:. and o-y := Var{\Yi;{:, j)\i). We note that because of the i.i.d. nature of the columns of Y, 
the mean /iy and the variance ay do not depend on j. We have the following lemma. 

Lemma 3(1 's form overwhelming majority in the most popular column): Let jmax be the most popular column. 
Under both case a) and b), there exists a sequence of positive reals {c„}, such that c„ oo with n, and w.h.p. 

I Yfe(:, — |Yfc(:, Jto(J3.)|o > c„. 

Proof: The proof is given in Appendix |B] ■ 
Now we use Lemma [3] to prove that PAF makes vanishingly small probability of error Suppose 

X„:={ye{0,l,*}":(|y|i-|y|o)>c„}, 

where {c„}'s are as in Lemma |3] We also observe that for a column j, 

j) — > Yfe(:, j) — > {jmax = j}, (3) 

i.e., the random variables {Xfe(:, j), Yfc(:,j), {jmax — j}} form a Markov chain. We are interested in finding the 
overall probability of error Due to the i.i.d. nature of the data model, all the columns of X have same distribution. 
Thus we have 

n 

Pe[PAF(/c)] = P^br-rax = j] ' Pr[X{l,j) = 0\jmax = j] 

= Pr[X(l,l) =0|j™a, - 1]. (4) 
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Here on, we analyze the error probability conditioned on the event that jmax = 1- In the following, by Pkjiu) we 
mean Pr[Yfc(:, j) = y\jmax = j,EU. Thus 

Pe[PAF(fc)] =Pr[X(l, 1) = Oljrnacc = 1] 

^^Pr[X{l, 1) = 0\j„,a. = l,EU + o(l) 

= ^'r[X(l,l) = 0,Yfc(:,l)=yb;„,, +o(l) 

y6{0,l,*}'= 

5] Pr[X{l,l)=0,Yk{:,l)^y\j„^a. = l,EU+o{l) 

^ Pr[X(l,l)=0|Yfe(:,l) = y,i?y .pfc,i(y)+o(l) 
y&M„ 

(rf) ^ Pr[Yfe(:,l)=y|X(l,l) = 0,i?JJ 

plvli(l-p)li/lo 

pISli-ISlo 

= 2. „i.-u-isi„ + n_„)i.-u-isio^^M(y) + o(i) 

< ™ax -pq T-r. -pq r-q h o(l) 

~ yi^Mn p\y\^~\y\^' + {1 — p)\y\^~'\y\'' 

= 0(1), (5) 

where (a) follows from ([2]i, (b) is true because Lemma |3] says that A^„ happens w.h.p., (c) is due to the Markov 
property ([3]) and the notation of pkj{y), (d) is the Bayes' expansion, and (e) is true since for y e Mn, \y\i — \y\o > 
c„, and the fact that for p < 1/2, p^_^_^i_py ans a; — > oo. This proves that Fe[PAF(fc)] 0. 

When a = and k increases to infinity with n, by following a similar line of statements as above, we see that 
there are increasingly many I's in the most popular column, and I's also for a majority in that column, thus the 
error probability approaches 0. We omit the details here. 

V. Proof of Theorem|2] 



The analysis for the Step 1 of the algorithm is exactly same as in Section IV-A Here we analyze the Step 2 of 
PAF(fc), conditioned on the event that all the top k neighbors are good (the event Ef 

Recall that k — n" We show that in this case the most popular column of Y^ (the top k rows of Y) has 

a finite number of unerased entries. This allows us to find a lower bound on the probability of error. Let H denote 
the set of column indices such that Y{l,j) = *, i.e., the columns where entries of the first row are "hidden". 

Lemma 4 (Finite number of unerased entries): W.h.p. 



max|Y,.(:,j)|< L1/7J- 



Proof: The proof is given in Appendix |C] 
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A. When I/7 is Not an Integer 

As in the previous section, let jmax be the column that is recommended by the PAF algorithm. Due to Lemma 
[4] we have {Yki', jmax)\i < L^/tJ w.h.p.. When I/7 is not an integer, the following lemma says that w.h.p. it is 
infact equal to [I/7J, i.e., in the most popular column, all the observed entries are I's. 

Lemma 5: If [I/7J is not an integer, then w.h.p. 

|Yfc(:, jmax)|i = LI/7J, and \Yk{:,jmax)\o = 0. 

Proof: The proof is given in Appendix |D] ■ 
Suppose 

In := {y e {0, 1, : |y|i = LI/7J , and \y\o = 0}. 

Lemma [5] says that I„ happens with high probability. We want to find the limiting behavior of the total probability 
of error By following the steps as in (|5]l and replacing the event M by the event / (this replacement is justified 
due to Lemma |5]i, we have 

Er,\y\i-\y\o 
-pq — i-i , rrq — Fi-p/c i(y) + 0(1) 

y€X„ ^ ' \ 

^"'^ P^^^ (-\ , (^\ 

= - oil)) + oil) 

pi-,i ^ (1 _ pjL^j 

L-J 

^ 'oil). 



[iJ ^ (1 _p)L^J 



P 



where (a) is true due to the definition of the set /, and (b) is true because of Lemma |5](I„ happens w.h.p.). Thus 
we have proved that 

lim Pe[PAF(fc)] ^ ^ 



B. When I/7 is an Integer 

We first prove the lower bound on the probability of error Due to Lemma|4] we have that \Y ki'-, jmax)\i I^^H 
w.h.p.. Define 

Jn:={ye{0,l,*}'':|y|<l/7}. 
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Thus Lemma |4] says that happens with high probabihty. By following the steps as in (|5]l and replacing the event 
M by the event J (this replacement is justified due to Lemma |4|i, we have 

jlSli-ll/lo 
pISli-ISlo + (1 — p)IS|i-|i/|o-' 



P.[PAF(fc)] = „|.U-|.lo7(l-pM^U-l.lo ^^.^(^)+"(l) 

nlSll-lalo 



(a) . plfli-lflo 



yeJri p|y|i-l2/lo + (1 — p)ly|i-blo 

(fc) 7)7 

+ (1 — p) ^ 

+ (1 — p) 



(1-0(1)) +0(1) 



where (a) is true because of Lemma [4j and (b) is true since — |y|o < \y\ < I/7 for y € I, and for x e M, 
p'^+Ii-pY ^ decreasing function of x for p < 1/2. Thus we have 

lim inf Pe[PAF(T)] > ^ (6) 

p- ^ (1 -p)7 

which proves the lower bound. To prove the upper bound, we need the following lemma. 
Lemma 6: If I/7 is an integer, we have w.h.p. 

\'yk{-Jmax)\l - \'yk{:,jraax)\o > I/T - 1- 

Proof: The proof is given in Appendix |E] ■ 
Define 

ICn :={yG{0,l,*}'=:|y|i-|y|o> 1/7-1}. 

The above lemma say that Kn occurs with high probability. Then following the steps as in Q, and due to Lemma 
|6] we have 

E„\v\i-\y\a 
-pq — i-i . rrq — rrPk Ay) + o(l) 
^ plyli-lalo + (1 -p)l2/|i-|alo^'''-^^''^ ^ ^ 

p\y\i-\y\o 

< max -pq -pq pq h o(l) 

~i/eK;„ pl'^li-lalo + (1 -p)|y|i-|alo ^ ^ 

(a) 

-^^^in ^+''(^)' 

p-r + (1 — pj -> 

where (a) is true because of Lemma |6j the definition of K and the observation that for a; G M, px^('x_p)x is a 
decreasing function of a: for p < 1/2. Thus we have 



lim sup Pe[PAF(fc)] < -^-^ , , 



n—^oo 



pi (1 _ T " 
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and this together with (|6| proves the Theorem when 1 is an integer 

To prove optimality of T — k, we consider neighborhood sizes Ti and T2 such that Ti < fc < T2. We then 
consider a related estimation problem, for which the maximum a posteriori (MAP) estimator has probability of 
eiTor equal to that of the PAF(fc). We also show that the probability of error for PAF(ri) and PAF(T2) equals that 
of two sub-optimal estimators of the above mentioned related estimation problem. Since MAP estimator minimizes 
probability of error over all estimators ll24l p. 8], this would prove the lemma. The detailed proof is presented in 
Appendix |F| 

The optimality of T = fc shows that VT, limsup„_j.o^ Pe[PAF(fc)] < liminf„_^oo ^'e[PAF(r)]. By substituting 
T = fc, we obtain 

lim sup Pe[PAF(fc)] < lim inf Pe[PAF(fc)] < lim sup Pe[PAF(fc)]. 
Thus the limit lini„_j.oo Pe[PAF(fc)] exists. This completes the proof of Theorem |2] 

VI. Proof of Theorem[3] 

Assume that a = ^ + (3, with /3 > 0. We assume that there are no errors (only erasures), i.e., p = 0, and show 
that the algorithm fails. To start with, we show that w.h.p. every row overlaps with the first row at most a finite 
number of places. This in turn implies that amongst the top T neighbors, only a vanishingly small fraction are 
good neighbors. Recall the definition of Sij from Q that measures the similarity between two rows. 

Lemma 7 (Finite overlap): There exists a constant tmax > (which depends on /3) such that w.h.p. maxi^isu < 

imax • 

Proof: The proof is given in Appendix |G] ■ 
Using Lemma [7] we first show that most neighbors of row 1 are bad. 

A. Most Neighbors are bad 

Suppose for a non-negative integer m, Ngoodi'tn) denotes the number of neighbors (excluding row 1 itself) from 
Ai that has m commonly sampled entries with row 1, i.e., 

Ngoodim) -.^lii £ Ai : sii = m}\. (7) 

More generally, for a row cluster Ai, we define 

N,{m) ■.= \{j e Ar- =m}\, (8) 

to be the number of neighbors in Ai with m commonly sampled entries. We see that iVi(m) = Ngoodim)- The 
total number of neighbors outside Ai are denoted by 

Nbad{m) := N2{m) + ... + iV,(m). (9) 

Let 

N{m):^ Ngood{m) + Nbad{m) (10) 
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denote the total number of neighbors. We show that for all m < tmax, ^good(™) forms a vanishingly small fraction 
of N{m). In the following lemma, we show that for "large" values of k, w.h.p. all the row clusters contribute equally 
to the top T neighbors (upto a constant factor), and for "moderate" values of k, w.h.p. the contribution of the first 
row cluster is vanishingly small compared to the total contribution of the other row clusters, and for "small" values 
of k, w.h.p. the first row cluster does not contribute to the top T neighbors. For all the three cases, amongst the 
top T neighbors, w.h.p. we have vanishingly small number good neighbors compared to the bad neighbors. 

Lemma 8 (Most neighbors are bad): There exists a constant C4 > such that for m = 1, 2, ...^tmax, 

1) If /fc > C4n"(2a-i) logr, then w.h.p. 

{N,{m)Y., = Q{nN^{m)]) = Q• ^ 



^ ^Jj — J- L ^ -'J-' \ y^m(2a— 1) 

2) If there exists a constant C5 > such that c^n^^'^'^^^^ ^ k < €477,"^^^""^^ logr, then w.h.p. 

{^,M},^=i = 0(logr). 

Moreover, there exists a subset 5 of such that |5| = fl{r), and for all j E S we have Nj{m) > 1. 

3) If /c = o(n"(2a-i))^ then w.h.p. Ngood{m) = 0. 

Proof: The proof is given in Appendix [H] ■ 
Since Ntadini) = J2i=2 ^ii''^) ^ 8°^^ infinity with n. Lemma [s] implies that good neighbors form a 

cluster Aj with an overlap at m or more entries. In other words. 



vanishingly small fraction of the total number of neighbors. Let Nj{rn^) denote the number of neighbors from the 



t„ 



Nj{m+):^^N^{t). (11) 

t—m 

Also let N{m'^) denote the total number of neighbors with an overlap of more than or equal to m entries, i.e., 

r 

iV(m+) := ^iVj(m+). (12) 

Lemma [8] implies the following corollary. 

Corollary 1: There exists a constant C4 > such that for m = 1, 2, tmax, 

1) If A: > C4n™(2a-i) log^^ then w.h.p. 

2) If there exists a constant C5 > such that c^n"^^^"^^^ <k< 0471,™'^""^) logr, then w.h.p. 

{iV,(™+)},^=i = 0(logr). 

Moreover, there exists a subset S of such that \S\ = ^{r), and for all j E S we have Nj{m^) > 1. 

3) If k = o(n"(2a-i))^ then w.h.p. Ngoodim+) = 0. 

Proof: The proof is given in Appendix |l] ■ 
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B. Even the Top Few Neighbors are Mostly bad 

Now we analyze what happens when we pick the top T rows (neighbors). We show that even amongst the top 
T neighbors, only a vanishingly small fraction are good neighbors. 

Recall that denotes the T x n sub-matrix of Y obtained by picking the top T neighboring rows. Let Xi denote 
the number of rows picked from the cluster i (excluding the first row itself). Thus T — X)i=i Ti + 1. Suppose mo 
is a positive integer such that 

N{{mo + \)+) <T <N{m+). (13) 

Then amongst the top T neighbors, we have all the rows that overlap at mo + 1 positions or more, and some of 
the rows that overlap at mo entries. To be precise, 

T, =iV,((mo + l) + )+6, (14) 

where is a hyper-geometric random variable with parameters {N{mo), Ni{mo),T — 1 — N{{mo + 1)^)) |^ 
implying 

m.] = - 1 - ^((-0 + 1)+)). (15) 

Summing both the sides of ( [T4] t over z, we observe that 

?■ 

Y,^^^T-l~N{{mo + l)+). (16) 

From ( [T4| l and ^T5\ we obtain 

E[T,] - iV,((mo + 1)+) + ^^^{T - 1 - iV((mo + 1)+)). (17) 

Lemmajsjand Corollary [T] now imply that that E[Ti] forms a vanishingly small fraction of of T. Using the Chvatal's 
hyper-geometric concentration lemma (see Lemma [16] in Appendix |N]i, we show in the following lemma that this 
is not just true for the expectation, but w.h.p. also for Ti. 

Lemma 9 (Top neighbors are bad too): There is a positive integer d and positive constants C6,C7, such that 
depending on the value of T, w.h.p. one of the following occurs. 
(Ci) Ti > C6 logr, and for i = 2, 3, r we have dT, > Ti. 

(C2) < Ti < C6 logr, and there is a subset S of [r]\{l} with 15*1 > cr^^^ such that Vi G 5 we have > 1. 

(C3) Ti = 0. 

Proof: The proof is given in Appendix |jj ■ 
This implies that amongst the top T neighbors, only a vanishingly small fraction are good neighbors. Step 2 of 
the PAF algorithm now performs a majority decoding on Y^, i.e., it recommends a column 

jmax = arg max | Yt ( : , j ) 1 1 , 

j:Yr(l,i) = * 

'After picking all the neighbors with an overlap of mo + 1 or more places, we need to pick T — 1 — N{{raa + l)"*") more neighbors with 
an overlap of m positions. But there are N{mQ) neighbors with an overlap of mo positions, out of which Ni(mo) are from the cluster i. See 
Appendix |Nj for the definition of a hyper-geometric random variable and some useful tail bounds. 
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leading to a probability of error P™"-' [Yt] Pr\S.{l, jmax) = 0]. Thus we have 

Pe [PAF(r)] = P™"^ [Ft] ■ (18) 
In the following section, we show that probability of error for the majority decoding approaches 1/2 w.h.p.. 

C. Analysis of Step 2 of the Algorithm 

In this section, we show that since the top T rows include many bad rows, choosing the most popular item 
amongst the top T rows does not perform well. To this end, since direct calculations are not analytically tractable, 
we take a somewhat circuitous route. We first show that when we increase the number of good neighbors and 
decrease the number of bad neighbors in a certain way, and some of the missing entries are revealed, then the 
probability of error reduces. We then lower bound the probability of error for this modified case, which is easier to 
analyze. We first introduce a new notation to represent the class of binary matrices with non-uniform cluster sizes. 
Suppose a and b are two vectors of length r. 

Definition 1 (Random binary matrix): Let X be a binary block constant matrix, whose ith row cluster Ai is of 
size a.{i) and the jth column cluster Bj is of size b(j). Suppose the entries of the matrix are filled as below. 
If {Pil) ^ Ai y. Bj, then X(p, q) = Xij where {xijYi j=i i.i.d. Bernoulli(l/2). This class of random binary 
matrices is denoted as X e Afjj(a, b). 

First we condition on the event that w.h.p. Ti = (i.e., condition (C3) of Lemma|9]is true). In this case, we see 
that the outcome of the majority decoding is independent of Ai, and hence we have 

pmaj [Yt|Ci] = 1/2. (19) 

We now consider the cases when either of the conditions (Ci) or (C2) of Lemma|9]are true. For this we consider 
a different matrix which has more good neighbors and fewer bad neighbors compared to Yy. Let u„ be the smallest 
multiple of d greater than or equal to Ti, i.e.. 




and suppose there is a subset S of [r]\{l} such that w.h.p. for j e S, Tj > /„ (we have such lower bounds on 
Tj, due to Lemma |9|l. Let (subscript "e" is for extreme values of the row cluster sizes) be the vector such that 
ae(l) = Un + 1, SLe{j) = In for j E S, and ae{j) = otherwise. Also let bjj be the r-length vector with all the 
entries equal to k. 

Suppose X^^-' S A/j^(ae, b[/), and only the first row of this matrix is passed through a memoryless erasure 
channel with erasure probability e to obtain the matrix Y^"^^ We note that there are no erasures in the rows other 
than the first one. We now perform a majority decoding for Y'^'^^ and let jmai(Y'^'^^) and P^'^3[Y^'^'>] be the 
column selected by the majority decoder, and the corresponding probability of error respectively. We then have the 
following lemma. 

Lemma 10: For Y*^^^ as defined above, we have 
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Proof: The proof is given in Appendix |K] ■ 
We now analyze the majority decoding on the matrix Y^'^^ when one of the conditions (Ci) or (C2) of Lemma 
|9]is true. We state this in the following lemma. 

Lemma 11: F™J[Y('^)|Ci] = 1/2 - o(l), and P™'^^' [Y^'^) IC2] = 1/2 - o(l). 

Proof: The proof is given in Appendix |L] ■ 
Due to ([TSj and Lemma [TO] we see that 

Pe[PAF(T)] > PJ"°J[y('=)] 
3 

= 1/2 + 0(1). (20) 
where (a) is due to Lemma |9] which says that U^gji 2,3}^ occurs w.h.p., and (b) follows from ( [T9] l and Lemma 



1 1 This completes the proof of Theorem [3] 



vn. Proof of Theorem|4] 

Proof of the converse: To prove this lower bound, we first assume that an oracle tells us the true row and column 
clusters (i.e., A and B). Let Pe,oracie{n) denote the error probability of the MAP estimator, when we know the 
clusters. Thus Pe.oracie{n) is a lower bound on the error probability of any recommender. As before, we assume 
wlog that we want to recommend an item to user 1 in Ai. 

Since entries across clusters are i.i.d., the MAP estimator would choose an item from the column cluster Bj for 
which we have maximum number of I's in the cluster Ai x Bj of Y. We note that while PAF picks a maximum 
weight column, this algorithm picks a maximum weight cluster and recommends a movie from that cluster. Because 
of the i.i.d. nature of the data model, the analysis for this algorithm is similar to that of analyzing PAF. 

Suppose Yyi-xSj denotes the matrix Y restricted to the cluster Ai x Bj. By using the steps similar to those 
used in proving Lemma |4] we obtain that 

max |Y^,xB,|< L1/7J. 

je{l,2,...,r} 

Then by defining 

Cn ■■= {ye{0,l,*}'=^'=:|y|< U/7J}, 



and using the steps similar to those used for proving the lower bound in Section V-B we see that 

L-J 

Pe.praclein) > ' —TT +o(l). 

pL^J + (1 -p)L-J 

Since Pe, oracle 

(n) is a lower bound for the error probability of any recommender, we have 

L-J 

Pe{n) > Pe,oracle{n) > + "(l)- 

plyi + (1 -p^y^ 

Taking lim inf „_j.oo of both the sides proves the converse. 
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Proof of achievability: We want to recommend an item to user 1 in row cluster Ai. We use the following 
algorithm to achieve the bounds. First we cluster the rows and the columns of the matrix as below. Each row 



chooses the k most similar rows, and each column chooses the k most similar columns (see Section |II-A| for the 
definition of "similarity"). For a < 1/2, below we show that all the rows (or columns) find the right set of neighbors, 
and thus we can find the true clusters of the matrix. Let the row clusters be denoted by A/s, while B/s denote 
the column clusters. To recommend, we choose an (unseen) item from the column cluster Bj for which we have 
the maximum number of I's in the cluster Ai x Bj. Let Pe{n) denote the probability of error for this algorithm. 

First we show that indeed w.h.p. the above method leads to correct clustering of the matrix. Let i?3 „ denote the 
event that we make an error in clustering. For a row cluster Ai, let X^^ denote the matrix X restricted to the rows 
in Ai. For row clusters Ai and Aj, suppose Dij denotes the number of column clusters at which X^. and X-Aj 
differ. Then for i ^ j, Dij ^ B{r, 1/2), and the Chernoff bound ll23l Theorem LI] implies that for S E (0, 1) we 
have Pr[Dij < §(1 - 5)] < e^'^'^'/^. Thus using the union bound, we obtain 



Pr 



minA, < ^(1-5) 



which approaches if r — ?> cjo. Thus w.h.p. all the Z^i/s are greater than §(1 — 5). Now using arguments similar to 
those used in proving Lemma [T] and Lemma [2] along with the union bound, we observe that w.h.p. all the rows find 
the right set of neighbors. Similarly, we can also prove that w.h.p. all the columns find the right set of neighbors. 
In other words, Pr[E^ „] — > 1 as n — > oo. 

For the rest of the proof, we condition on „. Since i?| „ happens w.h.p., wlog we can assume that the statistics 
of the individual clusters do not change asymptotically conditioned on „ (i.e., they are still i.i.d. as in the original 
data model). This is because if Pr[An] — > L and Pr[Bn] ^ b, then Pr[_B„|^„] — > & as well. 

Once we know the clusters, we recommend an item from the column cluster Bj for which we have the maximum 
number of I's in the cluster Ai x Bj. Suppose we denote this item by jo- As before, suppose Y^.xs^ denotes the 
matrix Y restricted to the cluster Ai x Bj. 

For fc^ > n"~'^", using steps similar to those used in proving Lemma [5] we see that there exists a sequence of 
positive reals c„, such that c„ oo, and w.h.p. 

In other words, the chosen cluster has overwhelming number of I's compared to O's. Then using the steps very 
similar to those in Section |IV-B| we see that 

Pe{n) ^ 0. 

For k"^ = using the steps similar to those used in proving Lemma |4] we obtain that 

max |Ya,xb,| < LVtJ- 

j£{l,2,...,r} 

Then by defining 

{ye {0,1, |y| < U/7J}, 
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and using the steps similar to those used for proving the bounds in Section |V-A| and Section V-B we see that 



I -I r--ii 

^ < lim Pe{n) < ^ 



Note that the above lower bound and the upper bound match, unless I/7 is an integer This proves the achievability. 

VIII. Conclusion 

We have considered a neighborhood based method (the PAF algorithm) for recommending items to users when 
some ratings are available. On MovieLens data and a snapshot of Netflix data, the BER of the PAF algorithm is 
similar to that of OptSpace|[3l, a method based on low -rank matrix completion. To explain this performance, we 
analyzed the PAF algorithm for a binary random matrix model introduced in |1 1. We consider the probability that a 
given recommendation is incorrect, and we identify the regimes where the PAF algorithm works well, as well as the 
regimes where it does not. In particular, the regime of a < 1/2 and k = where 7 > and gn — > seems 

to be the most suitable to describe the observed empirical results. Several extensions of this work are feasible, that 
can perhaps provide further insight into the performance on real data. 

Throughout this paper, we consider the case when PAF recommends only one item to each user. A natural 
generalization is to recommend multiple items (say, q items), instead of just one. Then we are interested in the 
probability that t (t < q) of these recommended items are correct. Although, because of the dependencies among 
the recommended items, this is not a straightforward generalization of the analysis of this paper and is an open 
direction for future work. One other important direction is to consider an alternative sampling mechanism that 
has "power law" characteristics similar to that seen in real data. Another direction is to generalize the class of 
underlying matrices. 

Appendix 

A. Proof of Lemma [2] 

For a row i ^ Ai, suppose denotes the number of column clusters of X that have different values in the 1st 
and the i-th row. Then there are r — _Di column clusters where the 1st and the i-th row of X match. Then Dik 
denotes the number of columns of X that have different values in the 1st and the i-th row. First we observe that 
there exists a constant ci e (0,1), such that 

w.h.p. for all i, Dik > Cin. (21) 

This is true when r is bounded, because the zth row and the 1st row of X differ at atleast one column cluster, implying 
Di > 1, and hence Dik > k = n/r. Using ci 1/r proves ( |2T] |. When r — cx) with n, we have Di = B{r, 1/2). 
Thus, the Chernoff bound ||23l Theorem 1.1] imply that for any S G (0, 1), w.h.p. min^^^^ Di > §(1 — 6)- Thus, 
min^^^^ Dik > ^{1 ~ S). Using ci now completes the proof of ( pTj l. 

Suppose we condition on the event that for all i, Dik > cin. We call this the event 5*1 „. If two given entries of 
X match, then the corresponding entries of Y are not erased and match with probability (1 — e)^((l — pY +P^)- 
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Similarly, if two entries of X differ, then the corresponding entries of Y are not erased and match with probability 

2(1 - e)2p(l - p). Thus we have su = B{{r - D,)k, (1 - e)^{{l - pf + p^)) + B(Afc, 2(1 - efpil - p)). In 
other words, su is a sum of n independent Bernoulli trials with 

E[si,] = (r - A)fc(l - efiil - P?+P^)) + 2AA:(1 - efp{l - p)) 

= nl-2"((l _ _ e)2((i _ 2p(l -p)) 

= ni-2"((l +p2) _ Afcn-2"(i _ 2p)2 

< ni-2"((l - pf + /) - cini-2°(l - 2p)^ 



where (a) is true because Dik > cin. Thus, due to the Chemoff bound 11231 Theorem 1.1], conditioned on S'i,„, we 
have that w.h.p. su < ^^^^"((l — p)^ + P^) — ci7i^~2"(l — 2^)^(1 + S). The lemma is now proven by observing 
from ( |2T| that Si,n happens w.h.p.. 

B. Proof of Lemma [i] 

We prove the lemma by first obtaining the following two lemmas proving a lower bound for \Yk{:, jmax)\i, and 
an upper bound for \Yk{:, jmax)\o respectively, which we prove towards the end of this section. 

Lemma 12 (Many 1 's in the most popular column): For different values of k, we have the following lower bounds 
on \yk{:.,jmax)\i- 

1) If /c = n"^'^" such that 7„ > and 7„ — ?> 0, then w.h.p. 

l^ki-Jmax)]! > mill I v/logn, ^ I . 

2) If k = n"gn for 5„ > 1, then w.h.p. 

|Yfc(:, j,„aa;)|i > max + min ^a-y^, ^/lognj ay, \/lognj . 
Lemma 13 (Few O's in the most popular column): For different values of k, we have the following upper bounds 

on \Yk{:,j-max)\l- 

1) If A; = n°'~^^ such that 7„ > and 7„ 0, then w.h.p. 

%/log n 1 



|Yfc(:, j„aa;)|o < min 
2) \f k = n^gn for gn > 1, then w.h.p. 



2 '47„ 



\'yki:,jmax)\o < maxj^y + i min |(Ty v^lognj cry, ^^^^ " | . 

These two lemmas together imply that there exist a sequence of positive reals {c„} such that c„ — > oo with n, 
and w.h.p. 

\'yk{-,jmax)\l — \'^k{-,jmax)\o > C„. 

This proves Lemma |3] Below we prove Lemma [12] and Lemma [13] 
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Proof of Lemma 12: Conditioned on the event that the top k neighbors picked by PAF are all good, Yi^{:,j) is 
binomially distributed for j e S. We prove the lemma by carefully lower bounding the upper tail of this binomial 
using a theorem on moderate deviations. 

1) Recall that we have conditioned on the event that all the rows in the top k neighbors chosen by PAF are good. 
Suppose k — n"^'''". Recall that S denotes the set of column indices such that X(l, j) = 1 and Y(l, j) = 
Claim 1: There exist a constant C2 > 0, such that w.h.p. \S\ > C2n. 

Proof of Claim^ We see that \S\ ^ B{M,e) where M ^ k ■ B (r, ^) . Here M denotes the number 
of columns of X with I's as the true ratings of user 1. For case a), where r increase to oo with n (since 
k = o{n)), due to Chernoff bound ||23] Theorem 1.1] we have w.h.p. \S\ > n/3. For case b), where k = Q{n), 
r stays bounded (suppose r < ro always) and since the first row of X is not all zero, we have A4 > k > n/rQ. 
Thus due to the Chernoff bound, we have w.hp. \S\ > n/2ro. This proves the claim. ■ 
For a column j E S we see that |Yfc(:, ~ B{k, (1 — e)(l — p)), and they are independent for different 
values of j. Thus, for j e S, 



Pr[\Yk{:,j)\,>t] >Fr[|Yfe(:,j)|i=t] 

|((l-e)(l-p))*6'=-* 
'c(l-p) 



> 

" t 



g-2i„(2)c/n-'"^ for large 



(22) 



-21n(2)c 



where (a) is true since 1 — (1 — e)(l — p) > e, (b) follows since e = 1 — c/n", \ ~ x > e^2in(2)2: ^^j. 
X e [0, 1/2], and (j) > (|)* (see ED p. 434]), and (c) is true because 7„ > 0. Since w.h.p. \S\ > C2n, we 
now have 



max )|i < t] <Pr 



max|Yfc(:,j)|i < t\\S\ > C2n 



< 1 



c(l-p) 



<e 



-21n(2)c 



0(1) 



+ o{l) 
o(l) 



(23) 



Suppose we put t = to := min{^/log n, 2^}- Then 



c(l-p) 



(c(l-p))' 



c{l-p) 



/log n 



o{n), 



where (a) follows since 7„t < 1/2 and t < ^/log n. Thus, from (|23| we obtain 

Pr[\Yk{:,j,na.)\i < to] < +o(l) =o(l). 

This proves the first part of the lemma. 
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2) Recall that we have assumed k ~ n^gn for (?„ > 1. By following a very similar analysis as in the first part 
we see that w.h.p. \Yk{:, jmax)\i > \/\ogn. In particular for gn — 1 (or equivalently for k — n"), ( [22] 
becomes 

c(l -p) Y ^_2 1n(2)c 



P'^[|Y,(:,j)|i>i] > 



c(l -P)^ „-21n(2)c 



e-^'"^^-'^ (24) 



Observe that for two random variables X and Y such that X ^ B{ni,p) and y ^ B{n2,p) with ni > n2, 
we have Pr[X > <] > Pr[Y > t]. Thus, using ( |24] l we have 



/'r [|Yfe(:,j)|i > t|.g„ > l] > Pr [|Yfe(:, > t|5„ = l] 

-P) y 21n(2)c 



> 

Hence for i = -^/log n, ( |23| ) has the following counterpart, 

Pr[|Y.(:,,_)K < t] <e--"(^)'-^-'^'Vo(l) 
= + o(l) = o(l). 

But in Lemma |3] we need better bounds for g„ — > oo, and we consider this case now. Recall that for j e S, 

fiY = S[|Yfc(:,j)|i] = c(l -p).g„ and = Var{\Yk{:, = c{l-p)g„ {l - (1 - e){l-p)). We define 
<„ min{(Ty^^, ^/Yogn}. Then tf^ = o((Ty) and Theorem [s] implies that for a column j e 5, 

1 

Pr[|Yfc(:,j)|i >/iy + i«fTY] =Q(t„) = -j=^e''^/^ 

V 27rt„ 

1 1 _f2 /Q 

>- — e , for large n 

2 v27rt„ 



\Jn log 



where (a) is true because Q[t) = ^/f^^ 126 !, Lemma L2], and (b) is true since i„ < -^/log n. Since 
w.h.p. |5| > C2n, we have 



)|i < Mr + tnCy] <Pr 



max|Yfc(:,j)|i < /iy + i„cry |5| > 



o(l) 



C2n 



<e^f^(75rfe:) + o(l) = o(l). 

Thus, w.h.p. |Yfc(:, j„jo:c)|i > +<nO-y, if ffn ^ oo. We have already observed that w.h.p. \Yk{:, jmax)\i > 
y/\ogn. Thus the lemma is implied. 



Proof of Lemma 13: First we condition on the event that X(l,j„iaa;) = 1- We observe that 

|Yfe(:,j)|o — > |Yfe(:,j)|i — > {j„iax = j}- 
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Then conditioned on the value of \Yk{:,jmax)\i = t, the distribution of \yk{-,jmax)\o does not depend on the fact 
that jmax is the most popular column chosen by the algorithm, and hence \Yk{-, jmax)\o ^ B {k — t,pQ), where 
Po '■— p^'ii^He • ^^^^ because for a given column j of Y^, upon observing that there are exactly t I's, the other 
k ~ t entries are i.i.d. with probabiUty of being pq. 

1) Suppose k — n°'~^^ such that 7„ — > 0. We define b{k,p,i) := (*^)p*(l — p)"^* to be the ith binomial term, 
and observe that b{k,p,i) < {kpe/i)\ since (*^) < {ke/iy (see ll25l p. 434]). We see that 

%/logn 



Pr 



|Yfe(:,j„ 



> 



^ b{k-t,po,i) 



21ogn k — t 

^ b{k-t,po,i)+ ^ b(k-t,po,i) 

■ yiog n i=21ogn+l 



i'^) ( 

< 2 logn ■ b I k ~t,pQ, 
[k ~ t)poe 



(b) 

< 2 log n 



(c) 

< 2 log n 



0(1). 



2c' 



2 



+ k ■ b{k - t, Pa, 2logn + I) 



+(fc - t) 



riT^-^/logn 



(fc - t)poe 
2 log n + 1 

c' 

7z^"(21ogn + 1) 



21ogra+l 



21ogn+l 



where (a) is true since b{k,p, i) is a decreasing function of i for i more than fcp and we have {k — t)pQ ~ o(l), 
(b) is due to the fact that b{k,p, i) < (kpe/iy , and (c) follows by observing that kpoe < (^:^^ for a constant 
c' > 0. Thus w.h.p. we have \Yk{:, jmax)\a < ^^^5^. 



Now suppose 7„ > 



2Vlogn 

Pr 



Then we see that 



|Yfe(:,j 



max ) lO 



> 



47n 



k-t 



^ b{k-t,po,i) 



(a) 



< ^ ((fc-t)poeA)^ 



e 



(4c'7„)^^/2' 



= 0(1), 
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where (a) is true since b{k,p,i) < {kpe/iy, (b) follows because kpi^e < (^^^"^ for a constant c', (c) is true 
by observing that for x = o(l), we have X^i^m ~ 6(x'"), and (d) follows since < 

In > 



whenever 



2v^log n ' 



Thus we have proved that w.h.p. \Yk{:, jraax)\Q < nun{^^^^, 
2) Now we consider the other case of k — n"(7„ for 5n > 1. If gn is upper bounded by a constant, then 



arguments very similar to those used in the first part tell us that w.h.p. \Yk{-, jmax)\a < 
In the remaining part of the proof, we assume that g„ oo. Recall that for a column j such that X(l, j) = 1, 
we have^y := i?[|Yfc(:, = fc(l - e)(l - p) and af. Vari\Yk{:, ^ k{l ~ p){l ^ e){l - {1 - 
e){l—p)). Conditioned on the value of {Yki'-, jmax)\i = t, suppose /iy and ct|, denote the conditional mean 
and variance of \Yk(:, jmax)\o- We observe that for t > fiy and large enough n, 

My = (fc - t)Po < fJ-Y, and cr?, (fc - t)pn{l -po) < 2CTy. 

Suppose t„ := minjay ^/\ogn}. Then we have tf^ = o(t7y), and since w.h.p. yi := \Yk{:, jmax)\i > fJ-Y 



(see Lemma 12 1, using Theorem [5] we obtain 

Pr 



<Pr 



\Yk{'-,3max)\o > + yo-y 



2^/2 



0(1) 



2\/2 



Thus w.h.p. \Yk{-;imax)\a < max{^/log n/2, ^y + ^c^y}- 

Remark 7: In the above proof, we had conditioned on the event that X(l,j„iaa;) = 1- When we condition 
on X(l, jniaa;) = 0, we havc pq — (ilp^^^^i^He ' ™^ ^ ^^^^ similar set of steps prove the lemma. 

C. Proof of Lemma |4] 

We condition on the event „ that all the top k rows are good . Due to|2j this event ^ occurs w.h.p.. Then 
we observe that for a column j e %, |Yfc(:, ^ _B(fc, 1 — e). Thus 



Pr[|Y,(:,j)|>™] = ^f j(l-e)*6 



\tk-t 



t—m 



t—m 

k 



< 



k < n"^"^ 



< 



t—m 

oo , 

i— m 

_ """" < 2 for large n. 
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Thus we have using union bound, 

m 

Prfmax |Yfe(:, j)| > m] < n ■ 2 = 2c™ni-'''" ^ 0, 

if m > 1/7. In other words, w.h.p. we have maxjg-^ |Yfe(:,j')| < [I/7J, conditioned on „. Since Pr[Ef = 
0(1), we have w.h.p. maxj^-u \Yk{:,j)\ < 

D. Proof of Lemma [5] 

Conditioned on the event -Ef „, for a column j G T-L, |Y/c(:, j)| ^ B{k, 1 — e). Thus 

(|)e(n(-^+9")Li/7j) (25) 

where (a) is true since for a constant m, (^^J = 0(fc™), 1 — e = c/n", and e'^^Li/fJ — > 1, and (b) follows since 
k — Let A be the set of columns j E H for which X(l,j) = 1 and |Yfe(:,j)| = [I/7J. For every 

column j, let 

1, ifje^ 

0, otherwise. 



Then by linearity of expectation, we have 

E[|A|] = E 



^ E[xj] = 5] Pr e A] e (ni+(-^+9„) L1/7J ) ^ (26) 

where (a) is true due to ( |25| ). We see that the rightmost expression in p6l ) increases to infinity, since g„ — o(l) and 
1 /7 is not an integer Moreover, for j e "H, Xj s are independent. Thus using the Chernoff bound we have w.h.p. 

|A| = e (ni+(-''+ff")Li/7jy (27) 

For a column j € A, 

Pr[|Yfc(:,i)|i = L1/7J] = (1-P)L^/^J. 

Thus there exists a column j E A with |Yfe(:,j')|i = LI/7J (^i^d hence |Yfc(:,j)|o = 0), with a probability not 
less than {l - {l - (1 - p) L^^'^J ) ''^'^ ^ 1. Thus we have w.h.p. \Y ki,-, jmax)\i > \Ml\- But due to Lemma 
we have w.h.p. \y k{'-T jmax)\ < L1/7J- Thus we have w.h.p. 

|Yfc(:,jma:c)|i = LI/7J, and |Yfe(:, j„a^)|o = 0. 
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E. Proof of Lemma |6| 

Conditioned on the event „, for a column j G "H, |Yfe(:, j)| ^ B{k, 1 — e). Suppose A be the set of columns 
j ^Ti. for which |Yfc(:, j)| = I/7. Then using similar steps as in the proof of Lemmajs] we obtain for a column j 

Pr[j e A] = e [n^^-i+a^^^h^ = 9 (^n^i+s-/^^ ^ (28) 
and by linearity of expectation, we have 

E[|yl|] = e fnf"/^) . (29) 



Using Lemma 14 with t — n^-^/^ logn, we have w.h.p. 

|A| = O (nf"/^logn) . (30) 

Now suppose B denotes the set of columns j G H for which X(l, j) = 1, and |Yfc(:, j)| — 1/^—1. Then by 
using similar steps as above, we obtain 

Pr[i e B] = e (n-^+^+'^r.ii/i-i)^ ^ (31) 

and by linearity of expectation, 

E[|B|] = e (^„T+9"(i/7-i)^ . (32) 

Thus using the Chemoff bound, we obtain w.h.p. 

\B\ = e (n''+9"(i/7-i)^ _ (33) 

For a column j G B, 

Pr[|Y(:,j)|i = 1/7- 1] = (1 -P)'^'^-'- (34) 

Thus by defining 

C:={j:|Y(:,j)|i = l/7-l,|Y(:,j)|o = 0}, 

we see that for a column j E B, 

Pr[j eC] = {l-py/'^-\ 



and by using linearity of expectation and ( [33| ), 

E[|C|] >E[|CnB|] 

= n (^n^+9"(i/7-i)^ ^ (35) 
which together with the Chemoff bound implies that w.h.p. 

|C| = n . (36) 

Thus, for the recommended column jmax, we have the following two possibilities. 
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1) We have \Y{:, j^ax)\i = 1/7- Since w. 



'.h.p. l^i'-, jmax)\ < 1/7 due to Lemma|4] we have w.h.p. 



2) We have \Y{:Jrnax)\i = 1/7-1- Then either j^ax e A, or j,nax e C. From ([30| and ([36]l, since .g„ = o(l) 
and 7 > 0, we see that w.h.p. \A\ is vanishingly small compared to \C\ . Thus w.h.p. jmax G C. Thus, from 
the definition of C, we obtain 



These two observations together proves the lemma. 

F. Proof of optimality of T = k 

Recall (|2]), which says that all the top k neighbors picked by the PAF algorithm are good w.h.p.. As before, let 
Ei n denote the event that a bad neighbor is picked amongst the top k neighbors. For the remainder of this proof, 
we condition on the event „, i.e., all the top k neighbors are good. 

Throughout this proof, a column j is good if X(l, = 1, and it is bad if X(l,j) = 0. Suppose Ti < k < T2, 
and A^^") denotes a set of Ti good neighbors, yl'^^ denotes the rest of the fc — Ti rows of Ai, and Bi is all the 
rows not in Ai that are picked amongst the top T2 rows. We see that = T2 — k. Recall that H denotes the 
set of columns j such that Y(l,j) = Now suppose, we do not get to observe Y; instead we get to observe the 
following random variables related to Y. 

• For all the columns j e H, we observe the corresponding number of I's restricted to A*^^' and A'^^\ To be 
more precise, let y^^'' denotes the j-th column of Y^-^' restricted to A*^*'. Then we observe (|yj-^''|i, lyj^^'li) = 
(s^^\ s^p) for all columns j e H. Let Ii denote this collection of observed random variables. 

• For all the columns j E H, we also observe the corresponding number of I's restricted to Bi. To be more 
precise, let yj'''' denotes the j-th column of Y^a, restricted to Bi (the superscript b is for bad). Then we 
observe |yj''''|i = for all columns j e H. Let I2 denote this collection of observed random variables. 

Upon observing Ii and I2, we want to find a column j E H such that X(l,j) = 1. First we consider the MAP 
estimator for this problem, which selects a column Jmap satisfying 



We again note that we get to observe only 2i and not Y. This MAP decoder makes an error with probability 
pMAP ._ pj-\^x{\,jMAp) 7^ 1]- We would now show that this probability of error is same as the error probability 
of the PAF algorithm with T = k. Amongst the columns j E H, let Q denote the set good columns (with X(l. = 1) 
and B denote the set of bad columns (with X(l, j) = 0). With this notation, conditioned on \Q\ = m, we now 



|Y(:,j„aa;)|i = 1/7 - 1, and \Y{:,jrnax)\o = 0. 



Jmap argmaxPr[X(l, = l|li,Z2]. 



(37) 
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have. 



Jmap = argniaxPr[X(l, j) = 1111,2:2] 

=^ argmaxPr[X(l, j) — llli] 
jen 

^argmax Pr[g ^ {j,ii, 

je'H ^ — ' 

{ii,...,i,„_i}C« 



(fc) 



arg max 



(38) 



{ii,...,i„._i}C« 



where (a) follows since X(l, j) is independent of I2, and (b) is true due to the Bayes' rule, since all the m-tuples 
are equiprobable candidates for Q, because of the i.i.d. nature of the columns of X. We observe that if j e Q, then 



.(1) 



P(Ti,(l-p)(l-e)), and 



and if J G S, then 



B(Ti,p(l-e)), and 



,(2) 



i3(fc-Ti,(l-p)(l-e)), 



It is also true that 



(conditioned on X(1,H)). Thus we have 

[{(ly^li, lyfli) = ■■3^n)\Q^ {j, zi,...,z„,_i} 



B(fc-Ti,p(l-e)). 

are all independent of each other 



n 



(39) 



where (a) follows due to the definition of X\. Thus, for j,/ e H such that j, / ^ {ii, we have 







1 ^m — 1}] 



















J '(/v-i/T -j'/T ^ i' i' ' 



_ \ (1) I (2) 

((1 - e))^i' (1 - (1 - - e))-''^'^'^'^{pil - 6))^^ (1 - p{l e)) 



(l-p(l-e)) 
p (l-(l-p)(l-e) 



s'^'+sf'-(.<.;'+.<.f) 



(40) 



Since p < 1/2, we now see that if Sj^"* + s^p > s^}"* + s^v\ then from ( |40] l 

Pr[Ii|G = > Pr[Ii|G = (41) 

From the above calculations, we also see that for e H such that — ■•■•*m-i}' have 

- 1. (42) 







•,«m-l}] 






*m-l}] 
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Thus in ( (38] l, each term in the summation is maximized for the column j with maximal sj^'' + s^^-*. Thus we have 
from ([38]l, 

i:Yfe(lj) = * ' ' 



(43) 



and hence is the same as choosing the column j of with most number of I's. Thus the probability of error for 
this MAP decoder is same as the error probability of the PAF algorithm for T = fc. To be precise, we have now 
shown that 



jMAP 



Pe[PkFik)\El„]. (44) 



Instead of using the MAP decoder, if we use the decoder that chooses the column that maximizes s'^j^\ then it's 
error probability is same as that of choosing the column j of Yt-^ with most number of I's. To be precise, suppose 
we use the following sub-optimal decoder that chooses 



•(1) _ 

hub-optimal ■— ^'^g . 



Then it's error probability is 

psub-opUmal,{l) ^ k¥ {Ti)\EI,^] . (45) 

Similarly a different sub-optimal decoder that chooses 

■(2) (1) , (2) , (b) ..^s 

Jsub-opumai ■= ^rg . max s) ' +s)' +s)\ (46) 

has error probability 

psub-opUmaL(2) ^ p^ [pAF(r2) | -B^,,,] . 



Since MAP is a minimum error probability L24. p. 8] decoder, and since Ti < k < T2, (44i, (45 1 and (46i together 
now imply that if T 7^ fc, then 

Pe [PkF{T)\El^] > [PAF{k)\El„] . (47) 
By observing that P[_Bi_„] = o(l) due to (|2]), we now obtain 

Pe [PAF(T)] > Pe [PAF(fc)] + 0(1). (48) 
Using lim inf „_j.oo to the left hand side, and lim sup„_j.o^ to the right hand side of (|48| implies the lemma. 



G. Proof of Lemma [7| 

To prove this lemma we show that V« = 2,3,...,n, su is dominated by a binomial random variable su ^ 
B{n, (1 — e)^). The lemma will follow by upper bounding the upper tail of sij. 
We first define another quantity to measure the overlap between two rows. 

n 

fc=i 
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From[Tj we see that sij > for all j. We first lower bound the upper tail of Sij. We see that Vj = 2, 3, ?i we 
have sij ~ B{n, (1 — e)^). Hence 

n 
s=t 

= EH(l-er(l-(l-a"-^ 



/ \ oo 
s=t 



1 - c2n-2/3 

(&) 



< 2c^*n"2^*, for lai-ge n, 

where (a) follows since (") < n*, 1 — e = and 1 — (1 — e)^ < 1, and (b) is true because 1 — cn"^^* 1 with 
n. Thus the probability that the overlap is more than t for some row ~ 2,3, ...,n is (by union bound) 

Pr[3j, sij >t]< 2{n - l^^n'^^^ -> 0, 

if 2/3t > 1, i.e. if f > Defining tj^ax ■— Lj^J proves that w.h.p. for all j — 2, 3, ...,n, Sij < tmax- Since 
sij > sij, we now have that w.h.p. for all j = 2,3, ...,n, Sij < tmax- 

H. Proof of lemma [S| 

To prove this lemma, we see that for k > C4ri™*^2"~^^ logr, Nj{m) are mixtures of Binomials with "high" mean, 
which lead to "strong" concentration around the mean. This implies that {Nj{m)}^^^ are within constant factors 
of each other. But, when k < 0471'"*^^"^^^ logr, we do not have "strong" concentration in general due to low mean, 
but we can suitably upper bound Ni{ni) and find a lower bound for the other Nj{mys. Finally when k becomes 
much smaller (= o(n™^2"^^^)), Ni{ni) = w.h.p. Below, we see this in detail. 

1) First we study the order of Ngoodi'm). Let L be the number of unerased entries of row 1. We see that 
L ^ B{n, 1 — e), hence E[L] = cn^^" and w.h.p. L = Q (n^~"), due to the Chernoff bound. Conditioned 
on the erasure sequence of row 1, we see that \/j e Ai, sij ^ B{L, 1 — e). Let Vj G Ai, 

pi{m) := Pr[si, = m\L = I] = - e)™e'-™. (50) 



Conditioned on L ^, every row j E A\ contributes to Ngoodi'm) independently with probability pi{m), 
implying Ngoodi'm,) ^ B{k,pi{ni)). Now for I = 9 (n^^"), 

K[Ngood{'m)\L = I] = kpi{m) 

= A:f ' Vl-e)'"e'-'" 



(a) ^ ( kV- 



el 1 - e( — ^ 1 , (51) 



m(2Q — 1) / I jj2l3m 
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where (a) follows since for a constant m we have (^^) ^ /™, 1 — e = c/?i", and e'^™ — > 1 for I = 6 (n^^"). 
Since k > C4ri^'^'" log r, conditioned on L = / = 8(n^~"), applying the Chernoff bound 123. Theorem 1.1] 
on Ngoodi'm) we see that for any 5 E (0, 1), w.h.p. 

E[Ngoodim)\L ^l]il~d)< Ng,od{m) < E[Ng,od{m)\L = + S), (52) 

We have already seen that w.h.p. L = 9 Thus ( |5T] i and ( |52] i now imply that w.h.p. Ngoodi'm) = 

Now we obtain a similar order bound for each of Ni{m). Recall the definition of Cj that it denotes the 
number of common column clusters of X between row cluster 1 and j. Then we see that for j = 2, ...,r, 
Cj B{r, 1/2), and thus the Chernoff bound 123] Theorem 1.1] implies that Vj = 2, ...,r, w.h.p. Q = Q{r) 
as long as r increases to infinity with n. For a row cluster Aj, let Qj be the number of unerased entries of 
row 1, restricted to these Cj common column clusters. Conditioned on the value of Cj = Cj, we see that 
Qj ^ B{cjk, 1 — e), implying E[(5j|Cj = Cj] = Cjk{l — e). Since n — kr, we have for Cj = 8(r), 

E[Q,|C, =c,]=e(n)(l-e)-e(ni-"), 

hence using the Chernoff bound we see that for 6 E (0, 1), conditioned on Cj — Cj, w.h.p. 

E[Qj\Cj = c,](l -S)< Qj < E[Q^\C, = Cj](l + S). (53) 



Since w.h.p. Cj = 8(r), ( |53| l now implies that w.h.p. Qj = 9 (n j . 

Let Si denote the number of commonly sampled entries of row 1 and row i E Aj within these Cj common 
column clusters. Then conditioned on Qj — q, we see that Si ^ B{q, 1 — e). Thus 



Pr[s, = m\Q, =q]= (1 - e)"e?-™ = p,(m), 



where Pq{m) is as defined in ( |50| ). We see that conditioned on Qj = q, each row i E Aj overlaps with row 1 
at TO entries independently with probability pq{m), i.e., Nj{m) ^ B{k,pq{m)) and thus, for q = 9 (n^^"), 
we have 

E[Nj{m)\Qj ^ q] ~ kpq{m) 

= fc(^)(l-.re-- 
(b) Q f kq 



where (b) follows since for a constant m we have (^) — <d{q"^), 1 — e — c/n", and e^^™ — > 1 for 
q — <d (n^^^y Since k > C4n^'^™ logr for a large enough constant C4, conditioned on Qj = qj = 9(n^^"), 
the Chernoff bound applied to Njim) along with an union bound gives that w.h.p. Vj = 2, 3, r we have 

E[Nj{m)\Qj = <7j](l -5)< Nj{m) < E[7V,(to)|Qj = q,]{l + S). (55) 
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As we have already seen that Qj = 6 {n^ , ( |54] i and ( |55] l now imply that w.h.p. 



This along with the previous observation that w.h.p. Ngood{in) — Q { ^Jlim \ proves the first part of the 
lemma. 

2) Before we start proving the second part of the lemma, we need a bound on the upper binomial tail with small 
mean. 

Lemma 14 (Tail of a binomial^23\ p. 23]): Suppose X ^ B{n,p) such that 'EIX] = np. For t > 2eE[X], 
we have 

Pr[X >t]< 2"*. 

In the proof of the first part, we have seen that conditioned on L = I, we have Ngoodim) ^ B{k,pi{m)), 
and for / = Q (n^^") we have 

E[Ngood{m)\L = l] = e = O(logr), 

where the last equality follows since k < 0471^^™" log r. Thus, for a large enough constant c' and for t > 
c' log r, we have 

Pr[Ngood{.m) >t]< 2-*, 

which implies that w.h.p. Ngoo^{m) ~ O(logr). We have also seen in the proof of the first part, that for 
j — 2, 3, r, conditioned on Qj — q, Nj(m) ^ B{k,pq{ni)) and for q = 8(n^^") we have 

E [N,{m)\Q, =q]^ kp,{m) = 6 (-^) = O(logr), 



where the last equality foUows since k < c^n^'^"^ logr. Thus Lemma 14 together with an union bound implies 
that w.h.p. 

Now we want to lower bound Nj{m) for j > 1. Recall that for j — 2,3, ...,r, conditioned on Qj — q, 

Nj{m) ~ B{k,pq{m)) and for q — Q{n^^") we have 

k 



E [N,{m)\Q, =q] = kpgim) = 6 ( ^ > 4, (56) 



for a constant Cg > 0, where the last inequality is true because k > csn^^™. Thus, for q = 6(n^ ") we have 

■.= Pr[Nj{m)^0\Q,=q] 
^^{l-Pq{m)f 

(b) 



<1, 
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where (a) is true since conditioned on Qj = q, Nj{m) ^ B{k,pq{m)), and (b) follows due to (|56|. Now we 
observe that conditioned on the values of {Q2, Q3, Qr }, {^j('")}j=2 '^^^ independent random variables. 
Let S denote the set of row clusters j £ {2, 3, r} such that Nj{m) > 1. Conditioned on the values Qj = qj 
for j — 2, r, we see that 

1-51 = El.-, 

where {lj}j^2 independent binary random variables with Pr[lj = 0] = Pp'^'^ In the first part of the 
proof, we have seen that w.h.p. for all j — 2, r, Qj = Q (n^^"). This along with a Chemoff bound on 
jiS*! implies that for any 6 E (0, 1), w.h.p. 

|5|>(r-l)(l-e-4)(l-<5). 

In other words there exists a subset S of such that \S\ = ^{r) and for Vj G S,Nj(m) > 1. 

3) We have already seen in the first part of the proof that w.h.p. E[Ngood{'m)] = © ( „2^m )■ Since k = 0(71^^™), 
we have E[Ngood{''n)] — ^ 0. Thus 

Pr[Ngood{m) > 0] ^ 0, 

since for a positive integer valued random variable X, 

00 

E[X] = ^Pr[X > 0]. 

i=0 

I. Proof of Corollary [7] 

1) The first part follows by observing that for k > C4n™'^^"~^) logr, w.h.p. 

t—m t—m ^ ^ 

^ l^„m(2a-l) J ' 

where (a) follows from the first part of Lemma |8] 

2) For the second part, suppose 0571™'^^"^^'' < k < 0471'"'^^"^^^ log r. Then for t > to + 1 we have k = 
o(n*(2"-i)). Thus w.h.p. 

Nj{m+) ^ Nj{m) + Nj{t) 

t=m+l 

t,nax 

^=^0(logr)+ £ o(l) = 0(logr), 

where (b) is due to the second part of Lemma [s] The fact that there exists a subset 5 of with 
\S\ ~ ^{r) such that for j e 5 we have Nj{rn^) > 1, follows immediately from the second part of Lemma 

HI 
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3) For k = o(n'"(^" ^^), we have w.h.p. 

Ngood{m+) = Y.^9ood{t) 0, 

t—m 

where (c) follows from the third part of Lemma [8] 
J. Proof of Lemma [9] 

We prove this lemma by using similar steps as used in proving Lemma [SJ the main difference is that we need 
a tail bound for hyper-geometric random variables, unlike the Chernoff bound for i.i.d. random variables used in 
proving Lemma [8] 

To begin with, we observe from ( [T5| ) and Lemma [8] that 
Obs.l) If fc > C4n'""(2a-i)iog^^ then we have w.h.p. {iVj(mo)}J^2 = e(iVi(mo)), implying {E[^j]}J^2 = 

0(EKi])- 

Obs.2) If there is a positive constant C5 > 0, such that Csn™"'^"^^^ < k < 04^™"'^^""^' log r, then w.h.p. 
E[6] =0(logr). 

Obs.3) If k = o(n'"°(2a-i))^ then w.h.p. E[6] = 0. 

We break down the proof into various cases for different values of T and k. As in ( [T3| ), suppose mo is a positive 
integer such that 

iV((mo + l)+) <T<N{m+), 

and C4 > is a large positive constant (same as the constant C4 defined in Lemma |8]l. 
Case 1 (A: > c4n^™'°+^^'-'^°'^^hogry. Corollary [l] implies that w.h.p. 

{N.iimo + 1)+)};:.! = e ( ^(,„„^f)(2.-i) ) = ^(^ogr), (57) 
and Theorem |8] implies that w.h.p. 

Also recall the definition of the hyper-geometric random variable from ([T4]|. Suppose cq is a large enough positive 
constant. We consider two possible cases. 

1) Suppose minj E[^j] > cglogr. Since {^j} are hyper-geometric random variables, from the hyper-geometric 
tail bound (Corollary [3] Appendix [N]| used together with an union bound, it follows that w.h.p. 

fe},^=i-e(E[a])=f^(logr), 

and this together with ( [T4] l and ( |57] l implies that w.h.p. {Tj}j^2 = ^{Ti). Thus there exists a positive integer 
d such that w.h.p. for j — 2,3, ...,r, we have Tj > dTi for large enough n. This implies (Ci). 

2) Now suppose minjE[^j] < cglogr. From Obs.l), we see that for different values of j, E[(,j] are within a 
constant factor of each other. Thus we have E[^i] = O(logr). Then Corollary |4] (see Appendix [n|) implies 
that w.h.p. 

Ci = 0(logr). 
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This together with ([T4| and (|57]l implies that w.h.p. {TjYj^^ = ^(Ti). This imphes (C2). 
Case 2 (3c5 > 0, C5n('"«+i)(2a-i) ^ < ^^^(mo+i)(2a-i) jQg^). Corollary [1] implies that w.h.p. 

7Vi((mo + l)+) = 0(logr), 

and there is a subset 5 of with \S\ = 0(r) such that for j E S, 

Nj{{mo + l)+) > 1. 

We now see from Obs.l) that E[^i] = O(logr), which together with CoroUarjjl] in Appendix |n] implies that w.h.p. 

a = 0(logr). 

Thus ( [l4] l now implies that w.h.p. Ti = O(logr), and there is a subset S of with \S\ = 8(r) such that for 

j e S, Tj > 1. This implies (C2). 

Case 3 (c4n™o(2a-i) jog,, < ^ Q^^(mo+i)(2a-i)^). jjjjg regime, Corollary [l] imphes that w.h.p. 

iVi((mo + l)+) =0. 

Depending on the value of we now consider three possible cases. Suppose cg is a large enough positive 

constant. 

1) Suppose minjE[a] > celogr. Then from the hyper-geometric tail bound (Corollary |3] Appendix used 
together with an union bound, it follows that w.h.p. 

{C,},^^i=e(E[a])=f^(logr), 

and this together with (\A\ implies that w.h.p. {2j}j^2 = ^{Ti}- In other words, there exists a positive 
integer d such that w.h.p. for j = 2, 3, r, we have Tj > dTi for large enough n, implying (Ci). 

2) Now suppose 3c5, C5 > 0, such that C5 < min^ E[a] < cglogr. Using Obs.l), this actually implies 
{E[a]}^^]^ — O(logr). Then the hyper-geometric tail bound (Corollary |4] , Appendix [n|), together with 
an union bound implies that w.h.p. 

te}-=i = 0(logr). (58) 

Suppose S" := {i G [r] : > !}• Since we have seen in ^ that X;^=i = T - I - N{{ma + 1)+), (|58]) 
implies that w.h.p. 

log r 

Since k > C4n™''*^^"~^-' logr, using Lemma |8]we see that {Nj{mo)}'j^2 = ^{Ni{mo)), implying 

{N,{mo)r^^, = QiNimo)/r). 

This observation together with ([TSj implies that 

E[ei] = §^(T-l-A^((mo + l)+) 
iV(rno) 

==e(l/r)(r-l-7V((mo + l)+). (60) 



+ (59) 
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Since ]E[^i] > C5, (|60| now implies that 

T - 1 - N{{mo + 1) + ) = nir). 

Thus from ( |59| ) we see that w.h.p. 

|^'Hf7(r/logr). 

Now using ( [T4I ) we see that w.h.p. Ti = O(logr), and for j e S", Tj > 1, implying (C2). 
3) If minjE[fj] — > 0, then using Obs.l), we see that E[^i] — > 0. This implies that w.h.p. ^1 = 0, since for a 
non-negative integer valued random variable X, E[X] — J^'iLo P'f'i^ > *]■ As we have already observed that 
w.h.p. Ni{{nio + 1)+) = 0, ([14]| now impHes that w.h.p. Ti 0. This implies (C3). 
Case 4 {k < 04^11"^°^^°'^^^ logr): In this regime. Corollary [l] implies that w.h.p. 

iVi((mo + l) + ) =0, 

since k = o („(™o+i)(2a-i)^^ If k ^ o(n'""(2a-i))^ ^^^^ Lemma [s] implies that w.h.p. 

iVi(mo) = 0, 

implying w.h.p. ^1 = 0, which together with ( [T4] i implies that w.h.p. Ti = 0, and hence (C3). Thus we now assume 
that there is a constant C5 > 0, such that k > 0571™°'^^""^). Using Lemma [s] we see that 

{iV,(mo)},^=i = 0(logr), 

implying 

{EK,]},^=i-0(logr), (61) 



due to (15 I. Depending on the value of E[^i], we now consider two possible cases. Suppose cq is a large enough 
positive constant. 

1) Suppose 3ce, cg > such that cg < E[^i] < cg logr. Using ( [M) and the hyper-geometric tail bound (Corollary 
|4] Appendix |N|, together with an union bound implies that w.h.p. 

Ci = 0(logr). (62) 

As in the second part of Case 3, suppose 5" :— {i e [r] : > 1}. Since we have seen in ([16]) that 
Ej=i = r - 1 - N{{mo + 1)+), ^ implies that w.h.p. 

\s'\ = n (i^i^mi}^i±m) . (63) 

Since 

C5n™«(2o-l) < < ^^^mo(2a-l) 
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using Lemma [s] we see that Ni^ttiq) = O(logr) and there is a subset S of with \S\ = 6(r), such 

that for j e S, Nj{mo) > 1. Thus we have N{mo) — ^{r). Using this observation together with ( fTS) we 
see that 

= 0{\ogr/r){T - 1 - N{{mo + 1)+)). (64) 
Since E[5i] > C5, ( |64) i now implies that 

T - 1 - iV((mo + 1) + ) = n{r/\ogr). 

Thus from ( |63| ) we see that w.h.p. 

I^'l =l](r/log2r). 

Now using ([14]) we see that w.h.p. Ti = O(logr), and for j e S", > 1. This impHes (C2). 
2) Now suppose E[^i] 0. This implies that w.h.p. ^1 = 0, since for a non-negative integer valued random 
variable X, E[X] = J2iloPr[^ > «]■ As we have already observed that w.h.p. Ni{{mo + 1)+) = 0, ([14]) 
now implies that w.h.p. Ti = 0. This implies (C3). 



K. Proof of Lemma 10 



To prove this lemma, we consider a "new" estimation problem and consider two different estimators, the first 
of which is a maximum aposterior probability (MAP) estimator having probability of error equal to the right hand 



side (RHS) of Lemma 10 whereas the second estimator is a sub-optimal one and has probability of error equal 



to the left hand side (LHS) of Lemma 10 Since MAP estimator minimizes probabiUty of error over all estimators 
II24I . this would prove the lemma. 

By increasing the number of good neighbors from Ti to m„ + 1 (recall that u„ is the smallest multiple of d not 
less than Ti), we increase the number of I's in the columns j with X(l,j) = 1, and do not change the number 
of I's in the columns j with X(l, = 0. Thus this reduces the probability of error for majority decoding on Yy. 
Thus to prove the lower bound on P™"^ [Yt], we assume without loss of generality that Ti = u„. 

For every row cluster Ai,i > 2 of Yt, suppose A^^'' represents the first Z„ rows in that cluster, and 
represents the rest of the Ti — Z„ rows. Consider the following estimation problem, where we do not get to observe 
Yt. Instead we observe the following two random variables. 

• For all columns j such that Y''^^(l,j) = *, we observe the corresponding column sums, i.e., we observe 
|Y('^'(:, j)|i = tj. Let Xi denote the collection of these observed random variables. 

• We also observe the column sums of Yy restricted to the second part of the row clusters. To make this 
precise, let yj denote jth column of Yt, restricted to U[^2^i ■ Then we observe Sj :— \yj\i- Let I2 denote 
the collection of these observed random variables. 
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Upon observing Ii and X^, we want to find a column j such that X'^'^^(l,j) = 1. First we consider the MAP 
estimator for this problem, which selects a column Jmap satisfying 

jhiAP arg max PrlX'-"^ {1, j) = l|Ii,X2] 

j:Y( = )(lJ)=* 

^=^arg max Pr[X('=)(l, j) = 

i:Y( = )(lj) = * 

= arg max Pr[X^'^Hl,j) = l\\Y^-\:,j)U ^ t,], (65) 

where (a) foUows since X('^^(l, j) is independent of X2, as X2 contains information only about the bad row clusters, 
and (b) is true because 

X(^)(l,j-)^|Y(^)(:,j)|i^Xi. 

We now state a lemma that will help in simplifying the above expression for Jmap- 
Lemma 15: Fr[X('=) (1, j) = 1||Y(^)(:, j)|i = tj] is an increasing function of tj. 

Proof: First we observe that |Y(^)(:, is a multiple of where 1^ is the size of bad row clusters of Y^'^-'. 
Thus tj — rrijln for some positive integer mj. We also see that — d (where d is as in Lemma [9|l. Let 

p, :-Pr[X('^)(l,j)-l||Y(^)(:,j)|i=i,], 



and po := 1 — Vi- Then 



Po ^ Pr[X(^) (l,j) = 0| I Y(^)(:,j)|i^^,] 
Pi Pr[X(-)(l,j) = l||Y(«)(:,j)|i=i,] 
(a) Pr[|Y(-)(:,j)|i =i,|x(^)(l,j) =0] 



(fc) 



Pr[|YW(:,j)|i=i,|X(^)(l,j) = l] 



C r-l \ 
\mj —dJ 

(rrij — (i)!(r — 1 — mj + d)] 

TOj!(r — 1 — mj)! 

(mj — + l)(mj — ff + 2) • • • mj 
(r — mj)(r — mj + 1) • • • (r — 1 — mj + 0?) 
1 



71 j / y mj — 1 / ymj— a+1 

where (a) is due to the Bayes' expansion and the observation that Pr[X('^)(l, j) = 1] = 1/2, (b) is true since for 
j such that Y^'^\l,j) ~ *, |Y(^)(:, ^ l„ x B{r, 1/2). Thus po/pi is clearly a decreasing function of nij. But 

^ = i_ _ 1 
Pi Pi 

implying that pi is an increasing function of rrij. Since tj = irijln, we now see that pi is an increasing function 
of tj. m 



July 18, 2011 



DRAFT 



43 



Using this lemma, (|65]l now becomes 



JMAP = arg max |Y('')(:, j)|i 

i:Y( = )(lj) = * 



Jr. 



,(Y(^)). 



In other words, majority decoding is same as the MAP estimator for the above estimation problem. Now we consider 
a different (sub-optimal) estimator for the same problem. Suppose 

jmaj :=arg max B{\Y'^"\:, - e) + Sj, (66) 

i:Y(=)(lj)=* 

where we recall that Sj denotes the number of I's in the jth column of Y^, restricted to U[^2^i ■ Let the 
corresponding probability of error be Pe[X]. We observe that the probability law of B{\Y^'^\:, + Sj is same as 
that of|YT(:, Thus we have 

=P™''[Yt], 

and this together with the fact that MAP estimator minimizes the probability of error 1,24. p. 8], implies that 
L. Proof of Lemma \11\ 

Conditioned on (Ci): We first condition on the event (Ci). In this case, ag is an r-length vectors with ae(l) = 
u^ + \ = d\^]+l, andfor j = 2,3,...,r, SL,{j) = [^] =: h. Let 

pi :=Pr[X(^)(l,j) = l|j„a,(Y('=))=j], 

and 

po 1 -Pi - Pr[X('=)(l,j) - 0|j„a,(Y(^)) = j]. 

By computing the ration pi/po using the Bayes' rule, it can be shown that the majority estimator is not worse 
than a random estimator, i.e., pi > 1/2 and pq < 1/2. For a column j such that Y('^^(1, j) — *, conditioned on 
X('=)(l, j) = 1, we have |Y('=)(:, j)|i ^ h{d + ^jj), where - B{r - 1, 1/2). Thus the Chernoff bound implies 
that for some 6r > with S,. = o(r), w.h.p. 



\-y'^'H-;j)\l^[h{l-S,),h['-+6r) 

Let A denote the interval [| — (5^, § + Sr]. Then by observing that |Y^^^(:, j)|i is a multiple of /i,we see that 

r 

Pi^Y. Pri^^'H-^^j) = l,|Y('^)(:,j)|i =^i"i|j™a,(Y(^)) =j] 



(67) 



r 

(a) 

ni—O 



J2 Pr[Y(^)(:,j)|i =/im|j„„,(Y(^)) =j-] •Pr[X(«)(l,j-) = 1||Y(^)(:,j)|i = 
J2 Pr[Y('=)(:,j)|i = /im|j„a,(Y(^)) = j] ■ Pr[X(^\l,j) = 1||Y(<=)(:, = hm] + o(l) 



^ Fr[Y(-)(:,j)|i = lim\j„,aj{Y("^) = j] ■ Pr[x('\l,j) = 1||Y(^)(:,j)|i = /i^n]. (68) 
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where (a) is true because of the Markov relation 

(b) follows due to ( |67] l, and (c) is true since pi > 1/2. Similarly we obtain 



po = E ^'^[Yf^H^jOli =^i™|jrna,(Y('=^) =J] •^KX('=)(l,j) =0||Y('=)(:,j)|i =;im] +0(1), (69) 



implying 



ratio mean := ■■ = — (70) 

l-Pr"'(Y(«)) Pi 





||Y(-)(:,j)|i =;im]+o(l) 




im|j™a,(YW)=j].Pr[XW(l,j) = 

|jma,(Y(-))=j]-Pr[X(^)(l,j)=0| 


= l||Y(^)(:,j-)|i='iH 
||Y('=)(:,j-)|i=^iH 


EmeAPr['y^'H-;j)\i^hm\ 


|j™a,(Y(-))-j]-Pr[X(^)(l,j) = l| 


||Y(^)(:,j-)li=^i"^] ^ 



We now have 



where 



ratio_mean > ratio_inin :— min ratio(m) + o(l), (71) 



. Pr[X(^)(l,j)=0||Y(-)(:,j)|i = ^im] 

ratio(m) := 



Pr[X(-)(l,j) = l||Y(-)(:,j)|i=/im] 
(c) Pr[|Y(-)(:,j)|i=;im|X('=)(l,j) = 0] 



(72) 



Pr[|Y(<=)(:,j)|i=;im|X(-)(l,j) = l]' 
where (c) follows due to the Bayes' expansion. We observe that conditioned on X''^''(l, j) — 0, |Y'^'^^(:, ~ h^Jj, 
where ipj ^ P(r — 1, 1/2); and we have already seen that conditioned on X('^)(l, j) = 1, |Y('^)(:, j)|i = li{d + i;j). 
Thus, for m ^ A, 

V m / 



ratio(m) 



\ra — d) 



(to — (i)!(r — 1 — To + d)! 
ni\{r — 1 — to)! 

W l{m.-d){r- l-m + d) {m - dy"-'^{r - 1 ~ m + dy-^-"'+'^ 
y m{r — 1 — to) m"^{r — 1 — mY~^~"^ 

(?) (m - d)™-'^(r - 1 - TO + d)'-i-™+'i 
^ TO™(r - 1 - to)'^-!-™ 

TO — ax /r— 1 — TO + a\ /r — 1 — to + cs 



r — 1 — TO / V TO — d 



(73) 



(/) 

= 1, (74) 

where (d) is due to the Stirling's approximation n\ = \/2nn{n/e)^{l + + 0(^)) (see ||25l p.434]), (e) is true 
smce for me A, '"'^jlri'j;^^''^ ^ 1, and (f) follows since for m e A each of the terms in ( |73] l approaches 1, 
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since d is a constant. Thus combining ( fTOj l, ( [7T| , ( [72] l and ( 74 1 implies that 

pmaj (y{e)\ 

— — > l-o(l), 

implying 

pmaj-(Y(«)) > 1/2 -o(l). 

But we have already observed at the beginning of this proof that po < 1/2. Thus 

pmaj(Y('=)) = 1/2 -o(l). 

Conditioned on (C2): A very similar set of steps prove the lemma when the event (C2) occurs. The main 
difference is that in this case d ~ O(logn), unlike being a constant for Y^'^^ But ( |74| i is still valid and hence we 
have F™«J(Y('=)) = 1/2 - o(l). 

M. Moderate deviation for binomial distribution 

To prove Lemma 12 and Lemma [s] we need the following theorem. Suppose Q{t) denotes the upper tail of a 
standard normal distribution, i.e., Q(t) -.^ —)= f°° e~* ^"^dt. 

Theorem 5 (Moderate deviations for binomial): Suppose X„ ^ B{n,pn). If tn — > 00 in such a way that = 

{Var{Xn)) = o{npn{l - Pn)), then 

Pr[Xn > npn + tn\/npn{l - Pn)] ^Q{tn)- 

The above theorem is an adaptation of a theorem about moderate deviations of binomials when p„ is a constant 
||26l p. 193]. The proof is very similar to the one presented in f26l for the constant probability case, and is omitted 
here. 

A^. Hyper- geometric tails 

Definition 2 (Hyper-geometric distribution): A random variable X has hyper-geometric distribution with param- 
eters (TV, m, n) if 

/m\ /N-m\ 

h{N,m,n,t) ■.= Pr[X = t] = * ' for fc = 0, 1, 2, n. 

It describe the number of success in a sequence of n draws from a finite population, without replacement. 

We have the following bound for the tail of a hyper-geometric distribution, due to Chvatal ||271 . 

Lemma 16 (Hyper-geometric tail, Chvatal[27J): Suppose a random variable X has hyper-geometric distribution 
with parameters {N,m,n). Define p :— m/N. Then for t > we have the following bound on the upper tail of 

^ ^((^-^)(^))". 

For a hyper-geometric random variable X, we have E[X] — By observing the symmetry h{N,m,n,t) — 
h{N ~ m,m,n,n ~ t), we obtain the following symmetric bound for the lower tail of X. 
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Corollary 2 (The lower tail): Suppose a random variable X has hyper-geometric distribution with parameters 
{N, m, n). Define p := m/N . Then for t>Q, 

The following is a consequence of Lemma 16 and Corollary |2] 



Corollary 3 (Simple tail bound): Suppose a random variable X has hyper-geometric distribution with parameters 
{N, m, n). Define p := m/N . Then 

Pr\E[X]{\ + 5) < X <¥.[X]{1 - &)] > 1 - 2e-^[^l'^'/3 



We also need the following version of Lemma 14 for hyper-geometric random variables. The proof is exactly 
same as of Lemma [14] and we refer to ||23l p.23] for the same. 

Corollary 4 (Tail of a hyper-geometric r.v.): Suppose X is a hyper-geometric random variable with parameters 

(TV, TO, n), so that E[X] = nm/N. For t > 2eE[X], we have 

Pr[X >t]< 2"*. 
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